NUWC-NPT  Technical  Document  1 1 ,138 
15  June  1999 


Bayesian  Classification  Using 
Noninformative  Dirichlet  Priors 

Robert  S.  Lynch 

Surface  Undersea  Warfare  Department 


Naval  Undersea  Warfare  Center  Division 
Newport,  Rhode  Island 


Approved  for  public  release;  distribution  is  unlimited. 


JH1C  QUALITY  UMLi'ttiiTtq)  4 


19990809  082 


PREFACE 


This  document  is  an  adaptation  of  the  author’s  1999 
dissertation  in  electrical  and  systems  engineering  for  the  degree  of 
doctor  of  philosophy  from  the  University  of  Connecticut. 


Reviewed  and  Approved:  15  June  1999 


& 

Patricia  J.  l/ean 
Director,  Surface  Undersea  Warfare 


REPORT  DOCUMENTATION  PAGE  Form  Approved 

OMB  No.  0704-0188 


Public  reporting  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of 
information,  including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway, 
Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 


1 .  AGENCY  USE  ONLY  (Leave  blank)  2.  REPORT  DATE  3.  REPORT  TYPE  AND  DATES  COVERED 

15  June  1999 


4.  TITLE  AND  SUBTITLE  5.  FUNDING  NUMBERS 

Bayesian  Classification  Using  Noninformative  Dirichlet  Priors 


6.  AUTHOR(S) 
Robert  S.  Lynch 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Naval  Undersea  Warfare  Center  Division 
1176  Howell  Street 
Newport,  Rl  02841-1708 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


TD  11,138 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


10.  SPONSORING/MONITORING 
AGENCY  REPORT  NUMBER 


11.  SUPPLEMENTARY  NOTES 


Adaptation  of  author’s  doctoral  dissertation. 


12a.  DISTRIBUTION/AVAILABILITY  STATEMENT 


12b.  DISTRIBUTION  CODE 


Approved  for  public  release;  distribution  is  unlimited. 


13.  ABSTRACT  (Maximum  200  words) 

In  this  dissertation,  the  Combined  Bayes  Test  (CBT)  and  its  average  probability  of  error,  R(e),  are  developed.  The  CBT  combines  training 
and  test  data  to  infer  symbol  probabilities  where  a  Dirichlet  (completely  noninformative)  prior  is  assumed  for  all  classes.  Using  P(e),  several 
results  are  shown  based  on  the  best  quantization  complexity,  M* (which  is  related  to  the  Hughes  Phenomenon).  For  example,  it  is  shown 
that  M*  increases  with  the  training  and  test  data.  Also,  it  is  demonstrated  that  the  CBT  outperforms  a  more  conventional  Maximum 
Likelihood  (ML)  based  test,  and  the  Kolmogorov-Smimov  Test  (KST).  With  this,  the  Bayesian  Data  Reduction  Algorithm  (BDRA)  is 
developed.  The  BDRA  uses  P(e)  (conditioned  on  the  training  data)  and  a  "greedy”  approach  for  reducing  irrelevant  features  from  each 
class,  and  its  performance  is  shown  to  be  superior  to  that  of  a  neural  network.  From  here,  the  CBT  is  extended  to  demonstrate  performance 
when  the  training  data  of  each  class  are  mislabeled.  Performance  is  shown  to  degrade  when  mislabeling  exists  in  the  training  data,  being 
dependent  on  the  mislabeling  probabilities.  However,  it  is  also  shown  that  the  BRDA  can  be  used  to  diminish  the  effect  of  mislabeling. 
Further,  the  BDRA  is  modified,  using  two  different  approaches,  to  classify  test  observations  when  the  training  data  of  each  class  contain 
missing  feature  values.  In  the  first  approach,  each  missing  feature  is  assumed  to  be  uniformly  distributed  over  its  range  of  values;  in  the 
second  approach,  the  number  of  discrete  levels  for  each  feature  is  increased  by  one.  Both  methods  of  modeling  missing  features  are 
shown  to  perform  similarly,  and  both  also  outperform  a  neural  network.  With  these  results,  the  BDRA  is  applied  to  three  problems  of 
interest  in  classification.  In  the  first  problem,  the  BDRA  is  applied  to  training  data  containing  class-specific  features;  in  the  second  problem, 
the  BDRA  is  used  to  fuse  features  that  have  been  extracted  from  independent  sonar  echoes.  Finally,  in  the  third  problem,  the  BDRA  is 
trained  and  tested  on  the  Australian  Credit  Card  Data  (ACCD).  In  all  three  cases,  the  BDRA  is  shown  to  improve  performance  over  existing 
methods. 


14.  SUBJECT  TERMS 

Bayesian  Theory  Information  Theory  Statistics 

Probability 

15.  NUMBER  OF  PAGES 

122 

Neural  Networks  Pattern  Calssification 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION 
OF  REPORT 

Unclassified 

18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

Unclassified 

19.  SECURITY  CLASSIFICATION 

OF  ABSTRACT 

Unclassified 

20.  LIMITATION  OF  ABSTRACT 

SAR 

NSN  7540-01-280-5500 

Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std.  239-18 
298-102 


TABLE  OF  CONTENTS 


Chapter  1:  Introduction  1 

1.1  Problem  Statement  and  Methodology . ' .  1 

1.2  Previous  Related  Research .  5 

1.3  Publications  of  This  Research . 12 

Chapter  2:  The  Combined  Bayes  Test  15 

2.1  Introduction .  15 

2.2  Combined  Information  Classification  .  16 

2.2.1  Combined  Multinomial  Model .  16 

2.2.2  Combined  Bayes  Test .  17 

2.2.3  Probability  of  Error .  19 

2.3  Results .  22 

2.4  Testing  the  Statistical  Similarity  of  Discrete  Data .  29 

2.5  Summary .  32 

Chapter  3:  The  Bayesian  Data  Reduction  Algorithm  33 

3.1  Introduction .  33 

3.2  Development  of  the  BDRA .  34 

3.3  Results .  39 

3.3.1  Performance  at  Reducing  Binary  Valued  Irrelevant  Features  40 

3.3.2  Performance  at  Reducing  Ternary  Valued  Irrelevant  Features  48 


l 


3.4  Summary 


51 

Chapter  4:  The  CBT  and  Mislabeled  Training  Data  53 

4.1  Introduction .  53 

4.2  Classification  With  Mislabeled  Training  Data  .  54 

4.2.1  Combined  Multinomial  Model .  54 

4.2.2  Combined  Bayes  Test  (CBT)  .  55 

4.2.3  Probability  of  Error .  57 

4.3  Results .  58 

4.4  Applying  the  BDRA  to  Mislabeled  Training  Data .  62 

4.5  Results  Using  the  BDRA . 64 

4.6  Summary .  67 

Chapter  5:  The  BDRA  and  Missing  Features  69 

5.1  Introduction .  69 

5.2  The  BDRA  Extended  for  Missing  Features .  70 

5.2.1  Method  1 .  70 

5.2.2  Method  2 . 74 

5.3  Results .  74 

5.4  Summary  .  .  77 

Chapter  6:  Application  of  the  BDRA  to  Miscellaneous  Problems 

in  Classification  79 

6.1  Introduction . 79 


u 


6.2  The  BDRA  Applied  to  the  Selection  of  Class-Specific  Features  .  .  80 

6.3  The  BDRA  Applied  to  the  Fusion  of  Features  From  Independent 


Sonar  Echoes .  86 

6.4  The  BDRA  Applied  to  the  Australian  Credit  Card  Data .  91 

6.5  Summary .  95 

Appendix  A:  Results  Using  Empirically  Generated  Data  97 

Appendix  B:  Development  of  the  Combined  Bayes  Test  101 


Appendix  C:  Development  of  /  (y  |  x*,,#*)  from  Probabilistic  Con¬ 
siderations  105 

Appendix  D:  Mean  and  Variance  of  the  Probability  of  Error  for 

Dirichlet  Distributed  Symbol  Probabilities  107 

Bibliography  111 


in 


LIST  OF  TABLES 


1  Threshold  Settings  for  Each  Feature  Before  Applying  the  BDRA  .  89 

2  Initial  Quantization  for  Each  Feature  of  the  ACCD .  92 

3  Performance  of  the  BDRA  and  a  Neural  Network  With  the  ACCD  93 

4  Final  Quantization  for  Each  Feature  of  the  ACCD .  94 


IV 


LIST  OF  FIGURES 


1  Representative  training  sets  for  two  classes .  2 

2  Conceptual  diagram  of  combined  information  classification .  7 

3  P(e )  for  various  numbers  of  discrete  symbols  M .  22 

4  P(e)  for  the  CBT  and  the  SGLRT .  24 

5  M*  for  various  training  data  set  sizes .  26 

6  Comparison  of  P(e)  using  different  Dirichlet  priors .  27 

7  The  minimum  analytical  P(e)  for  the  CBT  and  the  KST .  30 

8  Performance  of  the  BDRA  with  six  binary  valued  features,  and 

twenty  five  samples  of  training  data  for  each  class .  41 

9  Performance  comparison  of  the  BDRA  to  a  neural  network  with 
binary  valued  features,  and  twenty  five  samples  of  training  data 

for  each  class .  42 

10  Performance  results  for  the  situation  shown  in  Figure  8  with  one 

hundred  samples  of  training  data  for  each  class .  45 

11  Performance  comparison  shown  in  Figure  9  with  one  hundred  sam¬ 
ples  of  training  data  for  each  class .  46 

12  Estimated  number  of  relevant  binary  valued  features  by  the  BDRA.  48 

13  Performance  of  the  BDRA  with  six  ternary  valued  features,  and 

twenty  five  samples  of  training  data  for  each  class .  49 


v 


14  Performance  comparison  of  the  BDRA  to  a  neural  network  with 
ternary  valued  features,  and  twenty  five  samples  of  training  data 

for  each  class .  50 

15  P(e)  with  various  mislabeling  probabilities .  59 

16  P(e)  with  more  test  observations .  61 

17  Performance  of  the  BDRA  with  the  mislabeling  probabilities  of 

Figure  15 .  65 

18  Average  number  of  relevant  features  reduced,  out  of  a  total  of  six, 

from  the  training  data  of  each  class .  66 

19  Performance  comparison  of  the  BDRA  to  a  neural  network  when 

a  random  number  of  missing  features  occurs  with  a  probability  of 
0.15 . . : .  75 

20  Performance  comparison  of  Figure  19  repeated  using  the  BDRA 

and  Method  2 .  76 

21  Performance  of  a  class-specific  classifier  with  binary  valued  fea¬ 
tures,  and  five  samples  of  training  data  for  each  class .  82 

22  Performance  of  Figure  21  repeated  with  fifty  samples  of  training 

data  for  each  class .  84 

23  Performance  of  the  BDRA  with  class-specific  features  when  applied 

to  the  situation  of  Figure  21 .  85 

24  Target  recognition  performance  comparison  of  the  BDRA  to  the 

Chi-square  statistic,  and  an  OR  detector .  90 


vi 


25  Simulated  performance  comparison  of  the  CBT,  CGLRT,  and  the 

SGLRT  where  Ny  =  2 .  98 

26  Simulated  performance  comparison  of  the  CBT,  CGLRT,  and  the 

SGLRT  where  Ny  =  25 .  99 

27  Simulated  performance  comparison  of  the  CBT  and  the  KST.  .  .  100 


vii/(viii  blank) 


Chapter  1 


Introduction 


1.1  Problem  Statement  and  Methodology 

A  problem  that  has  received  much  consideration  in  the  technical  literature 
involves  classification  when  the  statistics  (probabilistic  models)  of  each  class  are 
unknown  and  determined  empirically  (many  examples  of  this  can  be  found  in 
[1-67]).  However,  an  aspect  of  this  problem  that  has  received  little  attention  is 
Bayesian  classification  of  discrete  observations  given  a  Dirichlet  (completely  non- 
informative)  prior  is  assumed  for  the  symbol  probabilities  of  each  class.  Therefore, 
the  focus  of  this  dissertation  is  the  performance  of  this  classification  method. 

By  “discrete”  it  is  meant  that  data  used  to  represent  each  class  can  take  on 
one  of  M  possible  values.  This  discrete  data  may  have  arisen  naturally  in  its 
M-level  form,  or  it  may  have  been  derived  by  quantizing  feature  vectors.  For 
example,  three  binary  valued  features  can  take  on  M  —  23  =  8  discrete  symbols 


1 


corresponding  to  the  eight  feature  vectors;  (0, 0, 0),  (0, 0, 1),  . . (1, 1, 1).  In  this 
case,  the  fineness  of  proposed  quantization  is  of  interest,  and  an  important  aspect 
of  this  work  is  to  provide  guidance  on  this  issue.  As  it  turns  out,  this  is  a  direct 
result  of  the  ability  to  place  a  uniform  prior  on  discretized  feature  vectors. 


Class  1 


Figure  1:  Representative  training  sets  for  two  classes. 


In  the  situation  of  interest  there  are  certain  labeled  realizations  of  this  (M- 
valued)  data,  and  this  is  referred  to  as  “training”  data  under  both  classes.  That 
is,  there  are  Nk  realizations  under  class  k  and  N[  realizations  under  class  /.  As 
an  example,  Figure  1  above  shows  representative  histograms  of  the  training  data 
for  two  hypothetical  classes.  In  this  figure,  there  are  ten  samples  of  training 
data  for  each  class,  and  eight  discrete  symbols  are  assumed.  It  can  be  seen  that 


2 


the  difference  between  these  two  histograms  constitutes  the  relevant  classification 
information  for  discriminating  between  them.  Now,  given  this  training  data,  it  is 
expected  that  Ny  unlabeled  and  quantized  “test”  data  are  observed,  and  these  are 
to  be  simultaneously  tested  by  a  classifier.  Notice,  the  goal  is  to  determine,  with 
minimum  probability  of  error,  from  which  class  the  unknown  test  data  have  been 
generated.  Conditioned  on  the  active  class  all  discrete  observations  of  training 
and  test  data  are  assumed  independent.  Thus,  it  is  reasonable  to  suppose  that 
the  observations  are  controlled  by  an  underlying  multinomial  distribution,  with 
the  parameters  of  this  distribution  -  the  probabilities  of  each  of  the  M  symbols 
-  unknown  and  presumably  different  for  each  class. 

The  unknown  symbol  probabilities  for  each  class  can  be  estimated  via  maxi¬ 
mum  likelihood  (ML)  in  the  obvious  way:  the  estimate  of  the  probability  of  the 
ith  symbol  is  the  number  of  observations  of  type  i  in  the  appropriate  training 
data  set,  divided  by  the  number  of  these  training  data.  Using  these  estimates 
testing  may  proceed;  performance  suffers,  however,  from  singularities  caused  by 
test  observations  being  of  types  unobserved  during  training.  Notice,  this  type 
of  test  has  been  referred  to  as  the  plug  in  (PI)  method,  [46,  52],  or  the  maxi¬ 
mum  frequency  recognition  rule,  [34].  Here,  this  test  is  referred  to  as  a  standard 
generalized  likelihood  ration  test  (SGLRT),  and  its  performance  is  discussed  in 
Chapter  2,  and  Appendix  A.  In  fact,  the  correctly-posed  generalized  likelihood 
procedure  relies  on  probability  estimates  culled  from  both  training  and  test  ob¬ 
servations,  and  here  this  is  what  is  known  as  a  combined  test.  In  this  case,  based 


3 


on  a  combined  multinomial  distribution  the  combined  generalized  likelihood  test 
(CGLRT)  appears,  and  its  performance  discussed,  in  Appendix  A.  But,  it  is  clear 
that  since  test  data  are  included  in  the  symbol  probability  estimates,  the  problem 
of  an  “unrepresented”  test  symbol  is  completely  avoided. 

Thus,  the  approach  of  this  research  could  have  been  based  on  a  combined 
generalized  likelihood  framework  (i.e.,  the  CGLRT)  due  to  its  appeal  from  a 
practical  perspective.  But,  from  a  theoretical  standpoint  it  is  less  attractive 
because  it  lacks  optimality  in  non-asymptotic  situations.  Therefore,  the  approach 
used  here  is  Bayesian:  that  is,  a  uniform  prior  distribution  (prior  information 
of  complete  ignorance)  is  assumed  for  the  symbol  probabilities.  Given  a  prior 
distribution  on  all  unknown  parameters  the  hypotheses  become  “simple,”  and 
likelihood  function  based  classification  is  both  reasonable  and  optimal. 

Generally  speaking,  a  uniform  prior  on  unknown  parameters  is  common  as  a 
basis  for  testing.  But  the  approach  often  labors  under  the  necessity  that  the  prior 
be  improper  and  “diffuse,”  or  even  may  be  uncertain  due  to  the  difficulty  of  ex¬ 
pressing  uniform-ness  in  a  meaningful  way  within  a  complicated  parameter  space. 
In  this  situation,  however,  a  uniform  prior  is  known  (and  credited  to  Dirichlet), 
explicit  (but  not  trivial),  and  denotes  probabilities  uniformly-distributed  over 
the  positive  unit-hyperplane  (i.e.,  an  M  dimensional  space  whose  elements  sum 
to  unity).  Further,  although  training  and  test  data  from  a  common  class  are 
statistically-independent  given  the  symbol  probabilities,  formation  of  the  likeli¬ 
hood  function  requires  integration  over  of  their  (product)  distribution  against  the 


4 


(Dirichlet)  prior  and  hence  expresses  their  dependence.  Such  a  test,  therefore,  is 
combined  in  its  form,  and  it  is  referred  to  in  this  work  as  the  Combined  Bayes 
Test  (CBT).  A  further  reason  for  reliance  on  this  model  is  that  analytic  proba¬ 
bility  of  error  figures  can  be  calculated,  and  as  will  be  seen  these  can  be  used  as 
a  basis  for  design. 

1.2  Previous  Related  Research 

The  application  of  a  uniform  prior  distribution  to  classifying  discrete  observa¬ 
tions  was  previously  studied  by  Hughes,  [34].  In  this  work,  Hughes  showed  that 
for  a  fixed  amount  of  training  data  for  each  class,  and  a  single  test  observation, 
the  average  probability  of  error  is  minimum  at  a  certain  measurement  complexity, 
M*.  In  regards  to  terminology,  Hughes  actually  showed  that  the  average  prob¬ 
ability  of  correct  recognition  was  maximum,  but  in  terms  of  being  a  measure  of 
performance  quality,  this  is  the  same  as  the  minimum  probability  of  error.  Also, 
in  addition  to  being  called  the  the  best  quantization  complexity,  M*  is  also  called 
the  best  number  of  discrete  symbols.  Additionally,  M*  is  often  referred  to  as  the 
Hughes  phenomenon  and  it  has  appeared  in  well  known  pattern  recognition  lit¬ 
erature  such  as  that  by  Duda  and  Hart,  [23]  (also  see  Fukanaga,  [25]).  However, 
Hughes  originally  developed  his  result  in  terms  of  the  maximum  frequency  recog¬ 
nition  rule  (i.e.,  the  SGLRT)  instead  of  the  correct  Bayesian  decision  rule  that  is 
based  on  the  Dirichlet  distribution  (this  was  later  pointed  out  in  [1,  14]).  But, 


5 


as  it  turns  out,  for  the  case  of  a  single  test  observation  Hughes’  results  are  valid 
because  the  Bayesian  decision  rule  and  the  SGLRT  perform  identically. 

In  the  decade  following  the  publication  of  Hughes’  paper  other  papers  ap¬ 
peared  suggesting  that  a  best  quantization  complexity,  M*,  does  not  apply.  For 
example,  in  [13,  21,  22,  41,  62]  it  is  shown,  and  for  any  training  set  size,  that 
the  probability  of  error  approaches  zero  as  the  number  of  independent  features 
approaches  infinity.  Notice,  this  represents  an  apparent  contradiction  to  the  re¬ 
sults  of  Hughes  which  show  that  once  the  point  M*  is  reached  the  probability 
of  error  tends  to  increase  with  additional  feature  information.  But,  this  appar¬ 
ent  “paradox”  was  later  resolved  in  a  paper  by  Van  Campenhout,  [61].  In  this 
paper,  Van  Campenhout  points  out  that  the  quantity  M*  is  based  on  incompara¬ 
ble  priors,  meaning  that  each  value  of  M  represents  a  different  true  prior  which 
cannot  be  compared  to  another  prior.  However,  an  alternative  and  equally  valid 
interpretation  of  M*,  and  one  which  is  of  interest  here,  is  that  M*  represents  the 
best  combination  of  discrete  symbol  quantity  and  numbers  of  training  data  for 
estimating  symbol  probabilities  (see  [23]).  Note,  the  relationship  between  sample 
size  and  estimation  ability,  as  they  affect  performance,  has  also  been  addressed 
in  [26,  32,  39,  57]. 

A  primary  contribution  contained  in  this  dissertation  is  development  of  the 
CBT  (see  Chapter  2),  which  is  a  generalization  of  the  findings  of  Hughes  to 
more  than  one  observation  of  test  data.  Therefore,  Hughes’  work  is  revisited  and 
extended  by  comparing  performance  of  the  CBT  to  an  uncombined  SGLRT.  In 


6 


particular,  it  is  shown  that  larger  numbers  of  test  data  cause  M*  to  increase  for 
the  CBT  with  an  overall  reduction  in  its  average  probability  of  error.  However, 
for  the  SGLRT  larger  numbers  of  test  data  cause  M*  to  either  remain  unchanged 
or  decrease,  and  its  overall  average  probability  of  error  increases.  With  these 
results,  it  is  also  shown  that  with  a  slight  modification  the  CBT  can  be  used  to 
test  the  statistical  similarity  of  two  discrete  data  sets  (i.e.,  were  they  produced 
by  the  same  multinomial  distribution).  In  this  application,  the  CBT  is  shown  to 
have  a  lower  average  probability  of  error  than  the  more  conventional  Kolmogorov- 
Smirnov  Test  (KST). 


Figure  2:  Conceptual  diagram  of  combined  information  classification. 


In  the  CBT,  both  training  and  test  data  are  combined  to  infer  the  true  symbol 
probabilities  while,  simultaneously,  the  test  data  vector  is  tested  for  class  mem¬ 
bership  (alternative  approaches  to  simultaneous  detection  and  estimation  can  be 


7 


found  in  [3,  6,  36,  49]).  To  illustrate  this,  consider  Figure  2  above  which  shows  a 
conceptual  diagram  of  combined  information  classification.  Notice,  by  combining 
all  available  data  the  CBT  is  particularly  effective  at  classifying  a  target  when 
distributional  mismatches  exist  between  the  training  and  test  data  (for  more  on 
this  as  applied  to  speech  recognition  see,  [24,  35],  also  see  Appendix  A).  A  likely 
explanation  for  this  effectiveness  is  that  combining  training  and  test  data  to  infer 
symbol  probabilities  implies  an  adaptation  of  the  test  space,  [58]. 

The  concept  of  combining  training  and  test  (i.e.,  labeled  and  unlabeled)  data 
to  improve  classification  performance  has  previously  been  studied  by  other  au¬ 
thors.  However,  results  identical  to  those  shown  here  have  not  been  found.  For 
example,  Merhav  and  Ephraim,  [46]  (also  see,  [24,  48]),  discuss  empirical  results 
of  a  method  (they  refer  to  this  as  the  approximate  Bayesian  (AB)  decision  rule), 
in  which  they  classify  speech  signals  using  Hidden  Markov  Models.  The  AB  de¬ 
cision  rule  is  based  on  the  joint  (i.e.,  combined)  statistics  of  the  training  and  test 
data.  However,  only  theoretical  results  based  on  large  sample  size  asymptotic  sit¬ 
uations  axe  provided,  whereas  the  results  shown  here  are  presented  for  any  sample 
size  (particular  emphasis  is  placed  on  small  sample  sizes).  With  this,  these  au¬ 
thors  provide  a  Bayesian  decision  rule  that  is  credited  to  Nadas,  [52],  The  result 
of  Nadas  closely  resembles  the  CBT  except  that  as  with  Hughes’  result  it  is  only 
given  for  a  single  observation  of  test  data.  Also,  in  another  example,  training  and 
test  data  were  combined  in  [59]  (also  see  [27,  50])  to  estimate  the  parameters  of 
Gaussian  mixtures  in  an  application  to  remote  sensing.  In  this  case,  it  was  found 


8 


that  the  additional  test  observations  significantly  improved  overall  recognition 
performance,  and  an  increase  in  the  quantization  complexity  was  also  observed. 

The  Dirichlet  distribution  (here,  the  noninformative  version  of  the  Dirichlet 
is  used)  is  the  conjugate  prior  of  the  multinomial  distribution,  and  in  this  work 
it  allows  the  combining  of  training  and  test  data  to  infer  symbol  probabilities. 
In  general,  applying  the  Dirichlet  to  the  multinomial  as  its  prior  is  well  known 
in  the  Bayesian  statistics  literature  as  the  Multinomial-Dirichlet  distribution  (for 
example,  see  [5]).  However,  in  its  typical  form  the  Multinomial-Dirichlet  is  un¬ 
combined  in  that  the  data  are  represented  by  a  single  random  variable.  For  more 
on  this  distribution,  many  applications  of  the  Multinomial-Dirichlet  are  found  in 
the  general  statistics  literature  (for  some  examples,  see,  [8,  16, 19,  28,  29,  51,  60]). 

As  opposed  to  a  Bayesian  approach  using  the  Dirichlet  distribution  an  al¬ 
ternative  approach  to  classifying  with  discrete  training  and  test  data  is  to  use 
a  neural  network.  It  is  well  known  in  the  literature  that  a  neural  network  re¬ 
duces  the  dimensionality  of  a  training  data  set  by  eliminating  irrelevant  feature 
information  (for  example,  see  [7,  20]).  Notice,  this  has  the  potential  to  improve 
classification  performance  for  a  given  training  data  set.  Therefore,  based  on  the 
CBT  the  Bayesian  Data  Reduction  Algorithm  (BDRA)  was  developed,  and  later 
its  performance  is  compared  to  a  neural  network  (see  Chapter  3). 

Development  of  the  BDRA  represents  another  contribution  of  this  work,  and 
in  the  results  it  is  demonstrated  to  be  overall  superior  to  a  neural  network  at 
reducing  a  training  data  set  for  improved  performance.  Typically,  data  reduction 


9 


is  synonymous  with  feature  selection,  and  in  the  literature  many  different  ap¬ 
proaches  can  be  found  (for  more  on  this  see,  [37,  40]  and  references  therein,  also 
see  [2,  9, 18,  23,  25,  43,  54,  63,  65]).  However,  an  aspect  of  the  BDRA  not  usually 
found  with  dimensionality  reduction  schemes  is  that  it  reduces  the  quantization 
by  reducing  (or,  in  many  cases  removing)  any  irrelevant  features  based  on  the 
theoretical  Bayes  error.  Therefore,  as  opposed  to  an  empirically  based  method 
such  as  adjusting  the  weights  of  a  neural  network  and  keeping  all  features,  with 
the  BDRA  it  is  easier  to  see  which  features  are  important  to  correct  classifica¬ 
tion.  Additionally,  the  BDRA  has  a  relatively  short  training  time,  and  it  does 
not  require  a  randomized  starting  configuration. 

Additional  results  found  in  this  dissertation  involve  extending  the  CBT  (and 
the  BDRA)  to  deal  with  the  problems  of  mislabeled  training  data  (see  Chapter 
4),  and  missing  features  in  the  training  and  test  data  (see  Chapter  5).  Each  of 
these  problems  is  treated  independently,  and  in  both  cases  optimal  Bayesian  tests 
are  developed.  Notice,  there  appears  to  be  little  information  on  these  problems 
by  other  authors,  but,  typical  of  what  can  be  found  in  the  pattern  recognition 
literature  is  the  book  by  Bishop,  [7],  In  this  book,  the  severe  effects  of  mislabeled 
data  are  briefly  discussed  and  related  to  estimation  in  the  presence  of  outliers. 
Also,  with  respect  to  missing  features,  he  describes  some  of  the  techniques  used 
to  alleviate  the  problem.  For  example,  a  method  often  used  is  to  ‘fill  in’  missing 
features  by  estimates  obtained  from  the  known  feature  values  (e.g.,  such  as  the 
sample  mean).  However,  these  methods  can  be  prone  to  problems.  With  these 


10 


results,  the  BDRA  is  further  extended  and  applied  to  reducing  the  training  data 
when  it  contains  class-specific  features  (see  Chapter  6),  and  most  of  the  related 
work  on  this  subject  can  be  found  in  [2].  In  this  case,  the  BDRA  is  shown 
to  be  an  effective  method  of  selecting  ad  hoc  class-specific  features,  and  it  also 
outperforms  the  class-specific  classifier. 

In  the  literature,  the  utility  of  classification  methods  is  often  measured  by  how 
well  they  perform  with  real  data,  (see  the  following  examples,  [9,  24,  26,  31,  32, 
35,  46,  54,  59,  63,  65,  66]).  Therefore,  in  addition  to  the  various  simulated  results 
appearing  in  this  dissertation  two  applications  are  shown  in  which  the  BDRA 
is  used  with  real  data  (see  Chapter  6).  The  first  application  involves  using  the 
BDRA  to  fuse  features  from  sonar  echoes  generated  by  independent  continuous 
wave  (CW)  and  Frequency  Modulated  (FM)  waveforms.  In  this  case,  the  sonar 
echoes  were  gathered  during  several  at  sea  experiments,  and  they  typically  are 
used  to  detect  and  track  surface  ships  and  submarines.  Notice,  in  the  literature 
there  does  not  appear  to  be  another  application  involving  sonar  data  and  an 
algorithm  like  the  BDRA,  and  as  it  turns  out,  the  BDRA  performs  well  at  fusing 
the  data  for  improved  target  recognition  performance.  In  the  second  application, 
the  BDRA  is  used  to  classify  the  Australian  Credit  Card  Data  (ACCD).  The 
ACCD  is  based  on  the  actual  credit  history  of  690  applicants,  and  performance 
results  with  other  algorithms  applied  to  this  data  have  appeared  in  [63,  66]. 
Relative  to  these  other  algorithms,  the  probability  of  error  for  the  BDRA  is 
comparable  to  the  best  that  has  been  achieved  with  the  ACCD. 


11 


1.3  Publications  of  This  Research 


The  following  items  list  all  current  publications  produced  by  this  research 
including  two  patent  applications. 

1.  R.  Lynch  and  P.  Willett,  “Classification  With  a  Combined  Information 
Test,”  Proceedings  of  the  IEEE  International  Conference  on  Acoustics, 
Speech,  and  Signal  Processing ,  May  1996. 

2.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Classification  System  and  Method 
Using  Combined  Information  Testing,”  Application  for  US  patent ,  Navy 
Case  No.  77879. 

3.  R.  Lynch  and  P.  Willett,  “Discrete  Symbol  Quantity  and  the  Minimum 
Probability  of  Error  for  a  Combined  Information  Classification  Test,”  Pro¬ 
ceedings  of  the  35th  Annual  Allerton  Conference  on  Communication,  Con¬ 
trol,  and  Computing ,  September  1997. 

4.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Bayesian  Classification  and  Data 
Driven  Quantization  Using  Dirichlet  Priors,”  Proceedings  of  the  32nd  Annual 
Conference  on  Information  Sciences  and  Systems ,  March  1998. 

5.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Testing  the  Statistical  Similarity 
of  Discrete  Observations  Using  Dirichlet  Priors,”  Proceedings  of  the  1998 
IEEE  International  Symposium  on  Information  Theory ,  August  1998. 


12 


6.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Bayesian  Classification  and  the  Re¬ 
duction  of  Irrelevant  Features  From  Training  Data,”  Proceedings  of  the  37th 
IEEE  Conference  on  Decision  and  Control ,  December  1998. 

7.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Bayesian  Classification  and  Discrete 
Symbol  Quantity  When  the  Training  Data  are  Mislabeled,”  Proceedings  of 
the  1999  IEEE  Information  Theory  Workshop  on  Detection ,  Estimation, 
Classification  and  Imaging ,  February  1999. 

8.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Classification  Using  Dirichlet  Priors 
When  the  Training  Data  are  Mislabeled,”  Proceedings  of  the  IEEE  Inter¬ 
national  Conference  on  Acoustics,  Speech,  and  Signal  Processing ,  March 
1999. 

9.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Bayesian  Classification  Using  Misla¬ 
beled  Training  Data  and  a  Noninformative  Prior,”  To  appear  as  an  article 
in  a  Summer  1999  issue  of  the  Journal  of  the  Franklin  Institute. 

10.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “A  Data  Reduction  System  for  Im¬ 
proving  Classifier  Performance,”  Application  for  US  patent ,  Navy  Case  No. 
79550. 


13 


11.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “Performance  Considerations  for  a 
Combined  Information  Classification  Test  Using  Dirichlet  Priors,”  To  ap¬ 
pear  as  a  correspondence  in  the  June  1999  issue  of  the  IEEE  Transactions 
on  Signal  Processing. 

12.  R.  S.  Lynch,  Jr.,  “Target  Detection  Performance  by  Fusing  Information 
from  Tracks  Generated  by  Independent  Waveforms,”  To  appear  in  the  Pro¬ 
ceedings  of  the  2nd  International  Conference  on  Information  Fusion. 

13.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “A  Bayesian  Approach  to  the  Miss¬ 
ing  Features  Problem  in  Classification,”  Submitted  for  publication  in  the 
Proceedings  of  the  38th  IEEE  Conference  on  Decision  and  Control. 

14.  R.  S.  Lynch,  Jr.  and  P.  K.  Willett,  “A  Bayesian  Approach  to  Feature 
Selection  Using  Noninformative  Dirichlet  Priors,”  Submitted  to  the  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics. 


14 


Chapter  2 


The  Combined  Bayes  Test 


2.1  Introduction 

In  this  chapter,  the  Combined  Bayes  Test  (CBT)  and  a  formula  for  its  average 
probability  of  error,  P(e),  are  developed.  The  CBT  combines  training  and  test 
data  to  infer  discrete  symbol  probabilities  where  a  Dirichlet  (completely  nonin- 
formative)  prior  is  assumed  for  all  classes.  Based  on  the  formula  for  P(e),  it 
is  demonstrated  that  for  fixed  training  and  test  data  set  sizes,  P(e)  reaches  a 
minimum  for  a  particular  quantization  complexity  called  M*.  Notice,  that  the 
quantity  M*  is  related  to  the  Hughes  Phenomenon  of  pattern  recognition,  [34]. 
With  these  findings,  it  is  also  shown  that  M*  increases  as  the  training  or  test 
data  set  sizes  increase.  Further,  to  show  its  effectiveness  as  a  classifier  P(e) 
for  the  CBT  is  compared  to  that  of  the  standard  generalized  likelihood  ratio 
test  (SGLRT).  While  the  main  body  of  this  chapter  is  concerned  with  detailing 


15 


the  classification  performance  of  the  CBT,  another  application  of  this  method  is 
described  where  it  is  used  to  determine  if  two  populations  of  discrete  data  are 
statistically  similar.  In  this  case,  performance  of  the  CBT  is  compared  to  that  of 
the  Kolmogorov-Smirnov  test  (KST). 

2.2  Combined  Information  Classification 

2.2.1  Combined  Multinomial  Model 

With  this  model,  it  is  assumed  that  there  exists  a  pair  of  probability  vectors, 
p k  and  pi,  the  ith  elements  of  which  denote  the  probability  of  a  symbol  of  type 
i  being  observed  under  the  respective  classes  k  and  l.  The  fundamental  model 
for  this  testing  method  is  thus  formulated  based  on  the  number  of  occurrences  of 
each  discrete  symbol  being  an  i.i.d.  multinomially  distributed  random  variable. 
Therefore,  the  joint  distribution  for  the  frequency  of  occurrence  of  all  training 
and  test  data  with  the  test  data,  y,  a  member  of  class  k  is  given  by 


/  (x*,  X/,  yjpjfc,  p /,  Hk)  =  Nk\Ni\Nyl  n  — P7 

Xk,ilxhi\yil 


(1) 


where  (in  the  following  notation  k  and  l  are  exchangeable,  and  in  this  and  subse¬ 
quent  chapters  a  boldface  font  is  used  to  indicate  vector  quantities,  while  upper 
case  is  used  for  matrices) 


k,  l  €  {class  1,  class  2},  and  k  ^  /; 

Hk  is  the  hypothesis  defined  as  py  =  p*; 


16 


M  is  the  number  of  discrete  symbols; 

Xk,i  is  the  number  of  occurrences  of  the  ith  symbol  in  the  training  data  for  class 

k; 

Nk{Nk=Zili  Xk,ij  is  the  total  number  of  training  data  for  class  k; 

Hi  is  the  number  of  occurrences  of  the  ith  symbol  in  the  test  data; 

Ny{Ny  =  Y^iLi  Vi}  is  the  total  number  of  test  data; 

Pk,i  {iZfci  Pk,i  =  l}  is  the  probability  of  the  ith  symbol  for  class  k. 

Note,  a  key  assumption  behind  (1)  is  that  the  same  underlying  symbol  distri¬ 
bution  (i.e.,  pfc  when  class  k  is  true)  produces,  independently,  both  training  and 
test  data.  This  is  evident  from  pxkkf+v'  where  in  the  exponent  training  and  test 
data  are  combined. 

2.2.2  Combined  Bayes  Test 

An  important  aspect  of  the  CBT  is  that  rather  than  assume  p*  and  p;  are 
simply  unknown  parameters  to  be  estimated,  and  the  resulting  test  a  combined 
generalized  likelihood  ratio  test  (CGLRT)  (the  CGLRT  represents  the  correctly- 
posed  generalized  likelihood  ratio  procedure  which  relies  on  ML  probability  es¬ 
timates  culled  from  both  training  and  test  data,  see  formula  (41)  in  Appendix 
A),  the  approach  here  is  to  give  them  prior  distributions.  We  assume  nothing  is 
known  a  priori  about  the  probability  vectors  and  so  we  use  an  “ignorance”  prior. 
One  version  of  prior  ignorance  is  provided  by  the  uniform  Dirichlet  given  by,  [34] 
(also  see,  [53]), 


17 


(2) 


/  (p*)  —  (M  —  l)\X^M^pki=^ 

where  Xsx\  is  the  indicator  function. 

In  the  uniform  Dirichlet  distribution  the  symbol  probabilities  are  uniformly 
distributed  on  the  unit  hyperplane ,  and  it  is  obtained  when  all  U  (all  /,•  must  be 
greater  than  zero)  of  the  distribution, 


/  {Pk,u  •  •  •  ••  Jm) 


r  (Eg,  4) 

ng,r(4) 


Ffe,X  Pk, 2 


(3) 


are  set  to  unity,  [5,  51].  Appendix  B  discusses  the  uniform  Dirichlet  further,  and 
in  particular,  its  marginal  and  conditional  distributions  are  shown  in  formula 
(48). 

Now,  using  the  uniform  Dirichlet  the  CBT  appears  as 


/  (xfc,x/,y|i/jk)  (Nk  +  M  -  1)!  (Ni  +  Ny  +  M  —  1)1  ™  (**,,•  +  Vi)\ (*w)! 

/(Xfc,xz, y|ft)  (Nk  +  Ny  +  M-  1)!  (Ni  +  M  -  1)!  (**,,-)!  (*w  +  Vi)'-  S,  T 

(4) 

where  the  decision  threshold  r  is  equal  to  P(Hi)/P(H, t)  for  minimizing  the 
probability  of  error.  Tests  like  the  CBT  shown  above,  which  involve  a  ratio  of 
posterior  distributions,  are  in  the  statistics  literature  sometimes  referred  to  as 
Bayes  factors,  [11,  38,  55]. 

The  complete  development  of  the  CBT  can  be  found  in  Appendix  B,  and  it 
is  included  for  completeness.  However,  the  CBT  can  also  be  determined  more 


18 


straightforwardly  (after  correct  substitution  of  model  parameters,  and  a  slight 
reworking  of  the  result)  from  the  Multinomial-Dirichlet  distribution  shown  in  [5]. 
In  fact,  the  BDRA,  which  is  developed  later,  is  actually  based  on  a  conditional 
CBT  equivalent  to  the  Multinomial-Dirichlet.  With  this,  it  should  also  be  noted 
that  the  CBT  can  easily  be  extended  to  classify  more  than  two  classes,  and  this 
is  discussed  in  Chapter  3. 

2.2.3  Probability  of  Error 

Letting  zk  =  f  (xk,xi,y\Hk)  (see  formula  (59)  in  Appendix  B),  the  average 
probability  of  error  for  the  CBT  is  defined  as 

P(e)  =  P  (Hk)  P  (zk  <  rzt  |  Hk)  +  P  [H{]  P  {zk  >  rZl  |  if,).  (5) 

It  is  necessary  to  only  show  the  first  term  of  (5)  as  the  second  term  is  similar 
except  for  conditioning  on  Hi.  Thus,  ignoring  P  (Hk),  the  first  term  of  (5)  is 
given  by 


p  {Zk  <  TZ,  I  Hk)  =  SHS^fc<rai}/(xfe,x,,y|iffc)  (6) 

y  x*  X( 

where  /  (x*,x,,y|iffc)  is  defined  in  formula  (59)  of  Appendix  B. 

It  is  apparent  from  formulas  (5)  and  (6)  that  computing  the  probability  of 
error  involves  summations  over  all  possible  configurations  of  training  and  test 
data.  Therefore,  to  reduce  computational  complexity  this  formula  is  worked 
further.  Notice,  a  vector  y  representing  Ny  test  samples  contains  My  of  M 


19 


possible  discrete  symbols  (where  ( My  <  M)).  In  other  words,  for  a  given  sample 
of  test  data  either  all  M  discrete  symbols  are  observed  or  some  subset  My  are 
observed.  Thus,  the  notation  for  training  data  is  redefined  as 

Xfc  =  (Xfcr,Xfcn) 

with  Xfcr  referring  to  those  training  data  which  are  “represented”  by  the  same  My 
discrete  symbols  as  the  test  data,  and  x^n  to  those  which  are  not.  Because  of  this, 
it  is  important  to  note  that  the  indicator  function,  Z{Zk<Tz of  (6)  depends  only 
on  y,  Xfcr ,  and  x/r.  Therefore,  summations  over  x*n  and  x;n  (these  summations 
are  not  required  if  My  =  M)  can  precede  first  in  formula  (6)  so  that  it  becomes 

p(*k  <  TZ,  I  Hk )  ylft)  (?) 

y  Xfcr  Xlr  X*n  X(„ 

where,  using  the  formulas  (57)  and  (58)  of  Appendix  B,  the  summations  over  Xfc„ 
and  Xin  produce 


E]C/(Xfc,X;,y|tffc) 

*kn  Xin 

v  ^  [( M  -  l)!)2  Nk WNy\  n  ft  (xfcj  +  i) 

+  ^  +  W  +  M  -  1)!  ii/i,  yi 
[(M  -  l)!]3  NklNtlNyl  „  ft  ( £ fcj  +  j) 

(iV*  +  Ny  +  M  -  ljlf.V,  +  M  -  1)!  yi 


X 


(  iVfc-EXfcr  +  M-My-l  N 
Afc  —  E  Xfcr 


/  \ 

iVi-Ex/r  +  M-My-1 


M  -  Ex/r 


(8) 


20 


With  respect  to  notation,  E  *-kr  above  means  the  sum  of  all  training  data  under 
k  that  are  represented  by  the  same  My  discrete  symbols  as  test  data.  Also,  the 
term  “ir”  in  formula  (8)  refers  to  that  subset  of  symbols  {1,2,  contained 


t  Nk-  J2*kr  +  M  -  My  -  1  ^ 


1  '  JV  CmmJ  • V#  I  "  '  —  J 

in  the  test  data.  With  this,  observe  that  means 

y  Nk  E  x-kr  j 

the  number  of  ways  Nk-J2  *kr  training  data  can  be  arranged  amongst  M  -  My 


discrete  symbols.  Further,  it  is  important  to  note  that  when  M  =  My  both 
1  Nk-T:xkr  +  M-My-  1  X 


factors  of  the  form, 
(8). 


Nk  -  EXfcr 


,  are  dropped  from  formula 


) 


In  addition  to  that  shown  above,  the  computational  complexity  associated 
with  computing  P(e )  in  the  formula  of  (5)  can  be  further  reduced  by  exploiting 
redundancy  in  the  test  data.  This  has  the  effect  of  reducing  the  required  number 
of  terms  to  be  summed  over.  The  method  to  do  this  is  based  on  the  fact  that 
for  a  given  number  Ny  of  test  observations,  there  are  a  finite  number  of  unique 
y’s  (each  vector  taking  on  a  varying  number,  My,  of  discrete  symbols)  that  are 
possible.  All  of  the  remaining  y’s  have  elements  that  are  redundant  orderings  of 
the  original  set,  and  these  orderings  occur  in  a  countable  number  of  ways.  For 
example,  an  important  simplification  used  throughout  the  results  obtained  in  this 
work  is  when  Wy  =  1.  In  this  case,  formula  (8)  becomes 


[(M  -  l)!]2  NklNjl 
(Nk  +  M)\(N,  +  M-l)\{  k’tr^  } 


21 


Nk-Y,Xkr  +  M-  2  Nt-Zxir  +  M-  2 


A*  -  £xfcr 


iV}  —  £x/r 


Note,  because  of  symmetry  in  the  Dirichlet  distribution  formula  (9)  is  equal  for 
all  “irv  in  {1,2,...,  M}.  This  implies  that  when  Ny  =  1  the  summation  over  y  in 
formula  (7)  involves  a  sum  of  the  same  M  terms.  Also,  the  summations  over  xkr 
and  Xir  in  formula  (7)  are  then  defined  as,  respectively,  £^*.=1  and  £^‘.=1. 


2.3  Results 


Number  of  discrete  symbols,  M 


Figure  3:  P(e)  for  various  numbers  of  discrete  symbols  M. 


Figure  3  is  a  typical  plot  of  the  results  presented  here,  and  unless  otherwise 
indicated,  in  all  figures  contained  in  this  dissertation  the  threshold  r  =  1,  meaning 
that  P  ( Hk )  and  P  (Hi)  are  both  0.5.  Also,  performance  results  for  the  CBT 
using  simulated  data  are  shown  in  Appendix  A.  In  Figure  3,  appears  the  average 
probability  of  error  for  the  CBT,  with  varying  M,  using  the  formula  shown  in 
(5).  In  this  example,  Nciass :  =  Nciass  2  =  10,  and  Ny  =  2.  Notice  in  Figure  3 
that  P(e)  starts  out  decreasing  with  increasing  M  and  is  minimum  at  a  point 
called  M*  (see  [34]),  and  in  this  case  M*  =  5.  Further,  it  can  be  seen  that  for 
M  greater  than  M*  P(e)  steadily  increases.  Observe,  P(e)  is  upper  bounded  by 
0.5  at  M  =  1  and  at  Af  =  oo  where  it  is  determined  by  the  priors,  P  (Hk)  and 
P  (Hi),  that  is,  at  these  points  the  training  data  cannot  be  relied  on,  [23]).  In 
[61],  the  dependence  of  P(e)  on  M  has  been  attributed  to  incomparable  models. 
That  is,  each  value  of  M  represents  a  different  prior  for  the  specified  numbers  of 
training  and  test  data,  and  as  such  performance  cannot  be  compared.  However, 
the  interest  here  is  in  problems  where  no  prior  knowledge  exists  neither  about  the 
discrete  symbol  probabilities,  nor  about  the  “correct”  quantization  complexity 
(M).  Thus,  within  this  context,  the  result  shown  above  implies  that  on  average, 
and  with  ten  samples  of  training  data  for  each  class  and  two  test  observations, 
best  classification  performance  (i.e.,  minimum  P(e))  occurs  when  five  discrete 
symbols  axe  used.  In  other  words,  if  more  than  five  symbols  are  used,  then  the 
probability  of  a  test  symbol  being  unrepresented  in  the  training  data  increases, 
and  this  causes  more  classification  errors.  On  the  other  hand,  if  not  enough 


23 


discrete  symbols  are  used,  then  the  true  probabilistic  structure  of  each  class  may 
not  be  adequately  represented,  and  this  also  increases  P(e).  When  faced  with 
uncertainty,  such  information  is  useful  as  a  guide  in  selecting  the  most  favorable 
quantization  complexity  for  particular  training  and  test  data  sizes. 


Figure  4:  P(e )  for  the  CBT  and  the  SGLRT. 


Figure  4  illustrates  the  performance  gains  of  including  test  data  to  infer  sym¬ 
bol  probabilities  by  comparing  P(e)  for  the  CBT  to  that  of  an  uncombined  gener¬ 
alized  likelihood  ratio  type  test  (referred  to  as  the  standard  GLRT,  or  SGLRT). 


24 


The  SGLRT,  being  uncombined,  does  not  incorporate  test  data  to  infer  sym¬ 
bol  probabilities,  and  it  is  given  by  (performance  results  for  the  SGLRT  using 
simulated  data  appear  in  Appendix  A) 


mhJj *)  n  g.>‘ 
j-JiPf. i  <  7 


where,  using  maximum  likelihood  estimates, 


(10) 


a  3'k,i  j  a  %l,i 

W„  =  — andai=- 


In  the  situation  of  Figure  4  the  training  data  of  each  class  are  fixed  to  ten 
outcomes,  and  four  different  test  data  sizes  are  used.  It  can  be  seen  in  this  figure 
that  when  Ny  =  1  both  tests  perform  identically  (this  was  previously  pointed  by 
Nadas,  [52]),  and  M*  =  4.  However,  when  Ny  is  increased  the  situation  changes 
quite  rapidly.  The  CBT  consistently  shows  an  overall  relative  decrease  in  P(e)  for 
a  given  M,  while  the  SGLRT  performs  worse  except  for  small  values  of  M.  With 
this,  when  Ny  is  increased  to  four,  for  the  CBT  M*  increases  to  six,  whereas  for 
the  SGLRT  it  decreases  to  three. 

A  few  additional  comments  are  included  to  supplement  these  results.  Notice, 
because  the  CBT  uses  both  training  and  test  data  to  infer  symbol  probabilities 
the  additional  information  provided  by  the  test  data  causes  a  relative  decrease  in 
.P(e),  and  an  accompanying  increase  in  M*.  Also,  intuitively,  the  probability  a 
test  datum  takes  on  a  discrete  symbol  unrepresented  by  training  data  increases 


25 


with  Ny  and  M ,  which  accounts  for  the  SGLRT’s  poor  performance  as  both  nu¬ 
merator  and  denominator  of  the  test  shown  in  (10)  will  then  have  an  increasing 
probability  of  being  zero  valued.  Further,  in  Appendix  D  it  is  shown  that  the 
minimum  P(e)  obtainable  with  a  Dirichlet  distribution  on  the  symbol  probabil¬ 
ities,  and  when  Ny  =  1,  is  equal  to  0.25.  But,  observe  for  the  CBT  in  Figure  4 
that  additional  test  observations  allow  P(e)  to  be  reduced  below  this  value. 


Figure  5:  M*  for  various  training  data  set  sizes. 

Figure  5  is  used  to  illustrate  the  relationship  between  best  discrete  symbol 
quantity  and  training  data  size.  Shown  for  both  the  CBT  and  the  SGLRT  is  a 
plot  of  M*  versus  training  data  sizes  of  up  to  twenty  outcomes  (Nciass !  =  Nciass  2). 


26 


Two  different  numbers  of  test  observations  appear  for  both  tests,  and  they  are 
Ny  =  1  and  Ny  =  4.  In  this  figure,  it  is  apparent  for  both  tests  that  when 
the  number  of  training  data  is  increased  a  larger  number  of  discrete  symbols  are 
required  for  best  classification  performance  (again,  both  tests  perform  the  same 
when  Ny  =  1).  But,  for  the  CBT  the  rate  of  increase  in  the  required  number  of 
symbols  is  faster  when  Ny  =  4  than  it  is  when  Ny  =  1,  while,  for  the  SGLRT 
the  opposite  is  true  (recall  Figure  4  where  performance  of  the  SGLRT  diminishes 
with  increasing  Ny). 


Figure  6:  Comparison  of  P(e)  using  different  Dirichlet  priors. 


27 


Before  proceeding  to  the  next  section,  notice  in  previous  work,  [42],  it  was 
found  that  in  relation  to  universal  encoding  a  better  noninformative  prior  to  use, 
given  unknown  true  statistics,  is  the  Dirichlet  with  all  U  of  formula  (3)  set  to 
one  half.  In  the  literature,  this  distribution  is  also  known  as  Jeffreys  prior  for 
the  multinomial  (see,  [10]).  Observe,  in  Figure  6  appears  curves  of  the  average 
probability  of  error  for  the  two  cases  of  a  uniform  Dirichlet  (i.e.,  Dirichlet (1), 
which  is  replotted  from  Figure  4),  and  a  Dirichlet  with  its  parameters  set  to 
one  half  (i.e.,  Dirichlet(l/2)).  With  this,  both  curves  in  Figure  6  are  based  on 
ten  samples  of  training  data  for  each  class  and  one  test  observation.  Notice, 
it  can  be  seen  in  this  figure  that  results  based  on  the  Dirichlet(l/2)  are  better 
than  those  based  the  uniform  Dirichlet.  In  fact,  the  value  of  M*  when  using  the 
Dirichlet(l/2)  is  one  less  than  it  is  when  using  the  uniform  Dirichlet,  and  this 
indicates  that  less  feature  information  is  required  for  best  performance.  However, 
although  the  Dirichlet(l/2)  shows  better  overall  average  performance,  it  does 
not  treat  each  symbol  probability  equally  (actually,  the  Dirichlet(l/2)  puts  more 
weight  on  probabilities  near  zero  and  one,  see,  [11])  so  that  the  uniform  Dirichlet 
is  used  here  to  better  represent  complete  ignorance  about  the  underlying  symbol 
probabilities  of  each  class.  Also,  use  of  the  uniform  Dirichlet  for  this  application 
is  consistent  with  previous  work  (e.g.,  [34,  61]),  and  as  will  be  seen  in  Chapter  3 
it  is  more  effective  than  the  Dirichlet(l/2)  at  data  reduction. 


28 


2.4  Testing  the  Statistical  Similarity  of  Discrete  Data 

In  addition  to  using  the  CBT  of  (4)  for  classifying  an  unknown  test  data 
vector,  it  can  also  be  used  to  test  if  two  samples  of  discrete  data  are  produced 
by  the  same  multinomial  distribution.  Analogously,  this  is  the  same  as  testing 
the  statistical  similarity  of  two  histograms.  Based  on  (1),  the  joint  distributions 
for  the  number  of  occurrences  of  all  symbols  for  both  data  sets  given  they  are 
produced  by  the  same  probability  vector  p*,  (i.e.,  Hi  :  pk  =  Pi)?  and  given  they 
are  produced  by  independent  probability  vectors  p*,  and  p;  (i.e.,  Hq  :  pk  p;) 
are  given  by,  respectively, 


M 

/(x*,x/|p*,p,,#i)  =  JJ 

2=1 


Xk^+Xii 

PkJ 


(12) 


and 


/  (xfc,Xf|Pfc,pz,  Ho)  =  Nk\m  n 

where,  using  notation  similar  to  (1) 


(13) 


k,  l  €  {sample  1,  sample  2},  and  k  ^ 

M  is  the  number  of  discrete  symbols  (or  histogram  bins); 

xk,i  is  the  number  of  occurrences  of  the  ith  symbol  for  sample  k- 

Nk  |  Nk  =  Eiii  xk,i }  is  the  total  number  of  occurrences  of  the  M  symbols  for 

sample  k\ 

Pk,i  {is  i  Pk,i  =  1 }  is  the  probability  of  the  ith  symbol. 


29 


A  CBT  for  this  application  can  be  developed  using  the  same  procedure  shown 
in  Appendix  B.  However,  it  can  also  be  obtained  directly  from  formula  (4).  This 
is  accomplished  by  first  eliminating  any  training  data  under  class  /,  followed  by 
renaming  the  y,  as  and  then  redefining  the  hypotheses  Hi  and  Ho  by  those 
shown  above.  In  either  case,  the  test  for  this  situation  appears  as 


Iffi)  (Nk± M-l)<{N,  +  M-l)l  ft  +  ,  , 


Figure  7:  The  minimum  analytical  P(e)  for  the  CBT  and  the  KST. 

In  Figure  7,  the  CBT  of  formula  (14)  is  compared  to  the  Kolmogorov-Smirnov 
Test  (KST),  [15].  Using  the  notation  of  (12)  through  (14),  the  KST  is  defined  as 


30 


1 


T 


(15) 


Hi 

_  > 

suPi<w<w  -  Fat, (to) |  1/2  ^ 

where  Fnk  (to)  represents  the  cumulative  distribution  function. 

It  is  also  noted  that  a  Chi-Square  Goodness  of  Fit  Test,  [33],  can  also  be  used 
in  this  application.  However,  like  the  CGLRT  it  is  in  general  only  asymptotically 
optimal,  and  it  must  be  modified  if  a  discrete  symbol  is  unobserved  in  the  data. 

The  advantage  of  using  the  test  of  (14)  in  place  of  the  test  of  (15)  is  clearly 
demonstrated  in  Figure  7.  In  this  figure,  the  minimum  average  probability  of  error 
in  testing  if  two  sets  of  discrete  data  belong  to  the  same  multinomial  distribution 
is  plotted  for  both  tests  versus  the  number  of  symbols  (M).  Notice,  results  are 
shown  for  symbol  quantities  of  from  two  to  five  symbols,  and  each  population 
contains  ten  samples  of  data.  Also,  by  minimum  average  probability  of  error  it 
is  meant  that  for  each  discrete  symbol  quantity  the  least  P(e )  over  all  possible 
thresholds  was  chosen  and  plotted  in  Figure  7.  Given  this,  it  can  be  seen  that  both 
tests  start  out  with  the  same  P(e )  when  M  =  2,  but  by  M  =  5,  P(e)  is  steadily 
decreasing  for  the  CBT,  while  it  is  increasing  for  the  KST.  With  this,  observe  in 
Figure  7  that  M*  occurs  at  three  discrete  symbols  for  the  KST,  and  for  the  CBT 
M*  appears  to  be  much  larger.  In  this  case,  due  to  computational  complexity 
M*  was  not  determined  theoretically  for  the  CBT,  however,  empirically  M*  was 
estimated  to  be  approximately  12.  In  general,  it  can  be  seen  that  the  superior 
performance  of  the  CBT  allows  for  more  precise  threshold  settings  when  testing  at 
a  specified  probability  of  false  alarm.  But,  similar  to  the  GLRT  for  classification, 


performance  of  the  KST  will  asymptotically  approach  that  of  the  CBT  as  the 
sample  sizes  become  large  (see  Appendix  A). 

2.5  Summary 

In  this  chapter,  it  has  been  demonstrated  that  given  only  the  training  and 
test  data  set  sizes,  and  without  any  a  priori  knowledge  of  the  underlying  symbol 
probabilities  of  each  class,  it  is  possible  to  determine  the  discrete  symbol  quantity 
which  minimizes  the  average  probability  of  error.  Specifically,  the  number  of 
discrete  symbols  achieving  this  minimum  point  was  called  M*.  Further,  it  was 
shown  that  M*  increases  with  the  training  data  size  for  both  the  CBT  and  the 
SGLRT.  However,  rates  of  increase  in  M*  are  higher  with  the  number  of  test  data, 
and  overall  P{e)  lower,  only  for  the  CBT.  Additionally,  the  CBT  was  shown  to 
achieve  a  lower  minimum  average  probability  of  error  than  the  KST  (except  when 
M  =  2  where  they  are  equal)  for  testing  if  two  discrete  data  sets  are  statistically 
similar. 


32 


Chapter  3 


The  Bayesian  Data  Reduction  Algorithm 


3.1  Introduction 

The  focus  of  this  chapter  is  on  developing  the  Bayesian  Data  Reduction  Al¬ 
gorithm  (BDRA),  which  is  based  on  the  CBT  of  Chapter  2.  The  BDRA  uses 
a  “greedy”  approach  for  reducing  features  from  the  training  data  of  each  class, 
and  it  relies  on  the  conditional  probability  of  error  for  the  CBT  (formula  (5)  of 
Chapter  2  conditioned  on  the  training  data)  as  a  metric  for  making  data  reducing 
decisions.  Performance  of  the  algorithm  is  compared  to  a  neural  network  at  clas¬ 
sifying  discrete  feature  vectors  which  contain  binary  and  ternary  valued  features. 
In  this  comparison,  it  is  shown  that  the  BDRA  is  superior  to  the  neural  network 
at  improving  overall  classification  performance  by  reducing,  or  eliminating,  ir¬ 
relevant  feature  information.  However,  for  a  fixed  amount  of  training  data,  the 
performance  of  both  schemes  degrades  as  the  quantization  complexity  of  each 
feature  is  increased  from  binary  to  ternary  values. 


33 


3.2  Development  of  the  BDRA 


A  fundamental  component  of  the  BDRA  is  the  conditional  probability  of  error 
formula  for  the  CBT.  But,  before  developing  this  formula,  and  its  associated 
decision  rule,  for  convenience  the  notation  used  here  is  itemized  first  (see  formula 
(1)  of  Chapter  2): 

•  C  is  the  total  number  of  classes  with  k  €  {1,  •  •  • ,  C}. 

•  M  is  the  number  of  discrete  symbols. 

•  Hk  is  the  hypothesis  defined  as  py  =  p^; 

•  X  =  (xi, . . .  ,xc)  is  the  collection  of  training  data  from  all  C  classes. 

•  xk,i  is  the  number  of  occurrences  of  the  ith  symbol  in  the  training  data  for 
class  k. 

•  Nk  {A*  =  Efei  art,,}  is  the  total  number  of  training  data  for  class  k. 

•  yi  is  the  number  of  occurrences  of  the  ith  symbol  in  the  test  data. 

•  Ny  {Ny  =  Eg,  is  the  total  number  of  test  data. 

•  Pk,i  {Eg  j  pk,i  =  l}  is  the  probability  of  the  ith  symbol  for  class  k. 

•  X{x)  is  the  indicator  function. 

The  conditional  probability  of  error  for  the  CBT  depends  on  a  decision  rule 
that  decides  if  an  unknown  test  vector,  y,  belongs  to  a  class  k  given  knowledge  of 


34 


the  training  data.  Thus,  the  distribution  f(y\X,  Hk)  must  be  found.  However, 
based  on  the  assumption  that  the  training  data  of  each  class  is  independent  of 
the  other  training  data  sets  (e.g.,  xi  is  independent  of  Hk),  this  distribution  is 
equivalent  to  f  (y\xk,  Hk).  Therefore,  with  equiprobable  classes,  the  decision 
rule  for  a  conditional  CBT  is  then  given  by  (a  similar  rule  without  specifying 
distributions  was  shown  in  [46]), 


max[f(y\xk,Hk))  (16) 

where  ties  are  broken  arbitrarily. 

Now,  it  is  straightforward  to  see  that  with  independent  training  data  sets 
for  each  class  the  distribution  f(y\xk,Hk)  can  be  obtained  from  the  ratio  of 
/  (y,  Xk|i?A:)  and  /  (xfc),  where  the  later  distributions  appear  respectively  in  Ap¬ 
pendix  B  as  formulas  (57)  and  (58)  (defined  for  Hk  instead  of  Hi).  Thus,  the 
conditional  distribution  of  formula  (16)  can  be  written  as 


/(ylxi  Hk)  -  —  +  TT  +  ('17') 

/  (yix*  Hk)  {Nk  +  Ny  +  M_  1}!  11  (,m)!  (w)!  •  (17) 

Note,  for  completeness,  formula  (17)  is  also  developed  from  probabilistic  consid¬ 
erations  in  Appendix  C.  Further,  a  consequence  of  independent  training  data  sets 
amongst  the  classes  is  that  formulas  (16)  and  (17)  represent  the  extension  of  the 
CBT  of  formula  (4)  in  Chapter  2  to  C  classes. 

Given  the  results  above,  and  letting  zk  =  f  (y\xk,Hk),  the  associated  condi¬ 
tional  probability  of  error  formula  for  the  test  of  (16)  is  given  by 


35 


P(e  |  x)  =  (is) 

fc=i  y  xA  x( 

The  results  in  this  chapter  are  based  on  one  observation  of  test  data,  therefore, 
with  Ny  =  1  formula  (17)  becomes 


/(y<  =  iix‘.^)  =  fr¥.  <19) 

where  i  €  {1, . . . ,  M}. 

Notice,  results  appear  only  for  the  case  of  Ny  =  1  because  this  is  sufficient  to 
illustrate  all  key  aspects  of  the  BDRA’s  performance.  However,  for  classification 
situations  in  which  Ny  >  1  it  is  expected  that  the  BDRA  will  be  applied  with 
the  appropriate  number  of  test  observations.  Then,  and  although  this  has  not 
been  fully  studied  here,  consistent  with  Figure  4  of  Chapter  2  the  additional 
test  observations  should  cause  a  relatively  lower  P(e  \  X)  (and  a  larger  final 
quantization  complexity)  as  compared  to  when  Ny  =  1.  Otherwise,  for  any 
Ny  the  general  relative  performance  characteristics  of  the  BDRA  should  remain 
consistent  with  those  shown  below. 

Now,  with  P(e  |  X)  defined  in  formula  (18),  the  following  iterative  steps  are 
used  in  implementing  the  BDRA. 

1.  Using  the  initial  training  data  with  quantization  complexity  M  (e.g.,  in  the 
case  of  all  binary  valued  features  M  —  2N* ,  where  Nj  is  the  number  of 
features),  formula  (18)  is  used  to  compute  P  (e  |  X;  M ). 


36 


2.  Beginning  with  the  first  feature  (selection  is  arbitrary),  reduce  this  feature 
for  each  class  by  summing  (i.e.,  merging)  the  numbers  of  occurrences  of 
those  quantized  symbols  that  correspond  to  joining  adjacent  discrete  levels 
of  that  feature  (e.g.,  with  binary  features,  for  all  classes  merge  those  quan¬ 
tized  symbols  containing  a  binary  zero  for  that  feature  with  those  containing 
a  binary  one). 

3.  Use  the  newly  merged  training  data  (it  is  referred  to  as  X ')  and  the  new 
quantization  complexity  (e.g.,  M'  =  2Nf~1  in  the  binary  feature  case),  and 
use  formula  (18)  to  compute  P  (e  |  X'\ M'^j. 

4.  Repeat  items  two  and  three  for  all  Nj  features. 

5.  From  item  four  select  the  minimum  of  all  computed  P  (e  \  X';  M'j  (in  the 
event  of  a  tie  use  an  arbitrary  selection),  and  choose  this  as  the  new  training 
data  configuration  for  each  class  (this  corresponds  to  permanently  reducing, 
or  removing,  the  associated  feature). 

6.  Repeat  items  two  through  five  until  the  probability  of  error  does  not  de¬ 
crease  any  further,  or  until  M'  =  2,  at  which  point  the  final  quantization 
complexity  has  been  found. 

A  few  additional  notes  about  the  BDRA  are  necessary.  First,  the  BDRA  is 
“greedy”  in  that  it  chooses  a  best  training  data  configuration  at  each  iteration 
(see  step  five  above)  in  the  process  of  determining  a  best  quantization  complexity. 


37 


In  fact,  the  six  steps  of  the  BDRA  shown  above  are  analogous  to  what  is  known  as 
a  backward  sequential  search  algorithm,  [37,  40].  Also,  a  slightly  better  approach 
than  the  six  steps  shown  above  is  to  do  a  global  search  over  all  possible  merges 
and  corresponding  training  data  configurations.  However,  a  simulation  study 
involving  hundreds  of  independent  trials  revealed  that  only  about  three  percent 
of  the  time  did  the  “greedy”  approach  shown  here  produce  results  different  than 
a  global  approach.  Additionally,  the  overall  average  probability  of  error  for  the 
two  approaches  differed  by  only  an  insignificant  amount. 

With  this,  it  should  also  be  noted  that  data  reduction  in  the  BDRA  implies 
a  change  in  Dirichlet  prior  (see  formula  (2)  of  Chapter  2)  with  each  merging  of 
training  data  (i.e.,  due  to  a  reduction  in  M).  But,  it  is  argued  here  that  chang¬ 
ing  the  Dirichlet  prior  by  the  BDRA  is  justified  because  it  essentially  removes 
irrelevant  feature  information  with  each  merge.  In  other  words,  the  BDRA  looks 
for  that  quantization  complexity,  M,  which  makes  the  most  sense  based  on  the 
training  data. 

Before  presenting  performance  results  of  the  BDRA  observe  that  as  a  check 
on  a  Bayesian  approach  to  this  problem  other  data  reduction  algorithms  were 
also  developed  based  on  the  SGLRT  of  formula  (10),  Chapter  2,  and  the  CGLRT 
of  formula  (41)  in  Appendix  A.  Development  of  these  algorithms  consisted  of 
substituting  the  appropriate  distributional  formula  in  the  Zk  of  formula  (18).  That 
is,  in  the  case  of  the  SGLRT  Zk  =  /(y|p*,  Hk),  and  in  the  case  of  the  CGLRT 
Zk  =  /  (x*,  y|p*,  Hk)  If  (x/t|pfc,  Hk).  But,  in  implementation  it  was  found  that 


38 


neither  of  these  GLR  methods  would  work  as  well  as  the  CBT  in  data  reduction. 
In  fact,  in  all  of  the  cases  examined  (i.e.,  situations  of  the  type  shown  in  the 
results  below)  an  SGLRT  based  data  reduction  method  could  not  eliminate  a 
single  irrelevant  feature.  On  the  other  hand,  for  the  same  training  data  sets  the 
BDRA  was  able  to  effectively  reduce  the  data  as  shown  below. 

3.3  Results 

Performance  results  of  the  BDRA  appear  in  the  figures  found  on  the  following 
pages.  In  all  cases,  one  test  observation  is  used  ( Ny  =  1),  and  there  are  two 
classes  (i.e.,  C  =  2).  Note,  the  results  presented  in  each  figure  are  averaged  over 
one  hundred  independent  trials  of  randomly  generating  symbol  probabilities  and 
associated  training  data.  With  this,  the  training  data  sets  of  each  class  contain 
either  six  binary  valued  features  or  six  ternary  valued  features  so  that  the  initial 
unreduced  quantization  complexity,  M,  either  equals  64  or  729,  respectively. 

The  following  items  define  the  notation  used  for  the  error  probabilities  shown 
in  the  figures  below: 

Unmerged  (Training  Data)  The  probability  of  error  computed  using  formula 
(18)  and  the  initial  training  data  configuration  of  each  class  before  data 
reduction. 


39 


Merged  (Training  Data)  The  probability  of  error  computed  using  formula 
(18)  and  the  final  reduced  training  data  configuration  for  each  class  (i.e., 
after  applying  the  BDRA). 

Optimal  The  probability  of  error  computed  when  the  true  underlying  symbol 
probabilities  are  known. 

Unmerged  (True)  The  probability  of  error  computed  using  formula  (18)  with 
the  Zk  based  on  the  initial  .unmerged  training  data  (and  formula  (19)),  and 
/  (y|xfc,  Hk)  replaced  by  the  true  symbol  probabilities. 

Merged  (True)  The  probability  of  error  computed  using  formula  (18)  with  the 
Zk  based  on  the  merged  training  data  (and  formula  (19)  after  applying  the 
BDRA),  and  /(y|xjt,  Hk)  replaced  by  the  true  symbol  probabilities. 

Neural  Network  The  probability  of  error  computed  using  a  trained  neural  net¬ 
work  as  a  decision  rule,  and  the  true  symbol  probabilities. 

3.3.1  Performance  at  Reducing  Binary  Valued  Irrelevant  Features 

Figure  8  below  shows  error  probabilities  of  the  BDRA  as  a  function  of  the 
number  of  relevant  features  for  each  class.  The  term  “relevant”  means  that  those 
features  are  distributed  uniquely  for  each  class  with  the  remaining  features,  out 
of  the  total  of  six,  being  distributed  the  same  amongst  the  classes.  Note,  that 
the  results  in  Figure  8  are  based  on  randomly  generating  twenty  five  samples 
of  training  data  for  each  class,  where  each  sample  is  a  vector  containing  six 


40 


0.45 

0.4 

i  i  i  i  i  1 1  i  i - 1 — 

N class  1  =  N class  2  =  25 

a 

- 

0.35 

- 

- 

0.3 

2 

a  (Unmerged  (Training  Data)); 
b  (Unmerged  (True));  c  (Merged  (Training  Data)); 

- 

<D 

B  0.25 
>» 

d  (Merged  (True));  e  (Optimal) 

- 

RES 

_  b 

- 

2 

CL 

0.15 

r 

0.1 

d _  _ _ _ — * — 

0.05 

0 

e 

- 1 - 1 - 1 - 1 _ i _ i _ i _ i _ i _ 

- 

1  1.5  2  2.5  3  3.5  4  4.5  5  5.5  6 

True  number  of  relevant  features  for  each  class 


Figure  8:  Performance  of  the  BDRA  with  six  binary  valued  features,  and  twenty 
five  samples  of  training  data  for  each  class. 

binary  valued  features.  Observe,  that  twenty  five  samples  of  training  data  is 
a  relatively  small  number  of  samples  to  estimate  the  probabilities  of  sixty  four 
discrete  symbols,  and  this  is  intended  to  make  correct  classification  more  difficult. 

It  can  be  seen  in  Figure  8  that  based  on  formula  (18),  and  for  all  numbers  of 
relevant  features,  the  BDRA  starts  out  with  a  probability  of  error  of  more  than 
0.35  (Unmerged  (Training  Data)),  and  then  reduces  this  to  around  0.1  (Merged 
(Training  Data)).  But,  in  terms  of  what  this  implies  for  correct  classification  per¬ 
formance,  notice  that  based  on  true  statistics  the  algorithm  is  able  to  reduce  the 
probability  of  error  from  about  0.2  (Unmerged  (True))  to  near  optimal  (Merged 


41 


(True)).  In  fact,  with  one  relevant  feature  for  each  class  it  obtains  the  optimal 
probability  of  error.  However,  there  also  is  somewhat  of  a  relative  loss  in  perfor¬ 
mance  for  the  BDRA  as  the  number  of  relevant  features  increases,  and  this  issue 
is  addressed  later  in  Figure  12. 


Figure  9:  Performance  comparison  of  the  BDRA  to  a  neural  network  with  binary 
valued  features,  and  twenty  five  samples  of  training  data  for  each  class. 

Performance  of  the  BDRA  is  compared  to  a  neural  network  in  Figure  9.  The 
situation  in  this  figure  is  the  same  as  that  of  Figure  8,  and  it  can  be  seen  that 
the  Optimal  and  Merged  (True)  results  from  Figure  8  are  repeated  here.  Notice, 
also  in  Figure  9  are  classification  results  for  a  neural  network,  and  the  term 
y/S  represents  the  sample  standard  deviation,  [12],  for  the  probability  of  error 


42 


averaged  over  all  numbers  of  relevant  features.  Clearly,  it  can  be  seen  that  in  all 
cases  the  BDRA  is  superior  to  the  neural  network  by  achieving  a  lower  probability 
of  error,  and  a  smaller  sample  standard  deviation.  But,  the  BDRA  appears 
to  deliver  best  performance  when  the  number  of  relevant  features  is  minimum, 
whereas  the  opposite  is  true  for  the  neural  network. 

In  generating  results  for  the  neural  network  it  was  trained  and  tested  using 
the  Neural  Network  Toolbox  of  Matlab,  [16].  It  is  a  feed-forward  network  (whose 
neuron  model  is  the  log-sigmoid  transfer  function),  which  was  trained  using  back- 
propagation,  momentum,  and  an  adaptive  learning  rate.  The  network  was  spec¬ 
ified  to  contain  three  layers  including  two  successive  hidden  layers  consisting  of 
sixteen  and  eight  nodes,  and  an  output  layer  of  one  node.  The  input  consisted 
of  six  nodes  corresponding  to  the  the  six  discrete  features,  and  initialization  of 
network  weights  was  random. 

The  following  items  describe  the  relevant  neural  network  parameter  settings 
required  by  the  Matlab  software: 

•  Maximum  number  of  epochs  to  train  (1000). 

•  Sum-squared  error  goal  (0.02). 

•  Learning  rate  (0.01). 

•  Learning  rate  increase  when  adapting  (1.05). 

•  Learning  rate  decrease  when  adapting  (0.7). 


43 


•  Momentum  constant  (0.9). 


•  Maximum  error  ratio  (1.04). 

Additionally,  and  this  is  relevant  to  most  of  the  results  in  this  chapter,  with 
all  figures  the  optimal  probabilities  of  error  were  constrained  to  be  less  than  or 
equal  to  0.1  (in  some  cases,  and  this  is  pointed  out  where  applicable,  the  con¬ 
straint  is  also  defined  to  be  greater  than  or  equal  to  0.5  as  well).  To  achieve  this 
constraint  it  was  necessary  to  use  Gaussian  mixture  densities  for  generating  the 
underlying  true  symbol  probabilities.  In  doing  so,  two  equiprobable  Gaussian 
mixtures  were  used  for  relevant  features,  and  three  equiprobable  mixtures  were 
used  for  irrelevant  ones  (the  dimension  of  a  particular  Gaussian  pdf  was  equiva¬ 
lent  to  the  appropriate  number  of  features).  Thus,  the  probability  of  observing  a 
binary  one  for  a  feature  was  equivalent  to  the  probability  of  observing  a  positive 
value  for  the  associated  element  of  the  corresponding  Gaussian  mixture.  Using 
this  model,  controlling  the  probability  of  error  to  meet  the  specified  constraint 
was  done  by  adjusting  the  spread  of  the  means.  The  reason  Gaussian  mixtures 
were  chosen  instead  of  the  Dirichlet  distribution,  or  a  similar  uniform  type  dis¬ 
tribution,  was  that  the  probability  of  error  using  the  Dirichlet  converges  to  0.25 
with  symbol  quantity,  M,  while  its  variance  approaches  zero  (see  Appendix  D). 
Therefore,  constraining  the  probability  of  error  to  small  values  using  the  Dirichlet 
was  computationally  impractical. 


44 


0.25 


0.2 


n 

<0 

.q 

2  0.1 
£L 


0.05 


0 


i  i  i  i  i  i  i  i  r 

- -  -  a 


Nciass  1  —  N class  2  —  100 

a  (Unmerged  (Training  Data)); 
b  (Unmerged  (True));  c  (Merged  (Training  Data)); 
d  (Merged  (True));  e  (Optimal) 


b 


e 


J - 1 - 1 _ L 


1  1.5  2  2.5  3  3.5  4  4.5  5  5.5  6 

True  number  of  relevant  features  for  each  class 


Figure  10:  Performance  results  for  the  situation  shown  in  Figure  8  with  one 
hundred  samples  of  training  data  for  each  class. 

In  Figures  10  and  11  the  effects  of  increasing  the  training  data  size  (i.e.,  by  a 
factor  of  four  or,  Nciass  i  =  Nciass  2  =  100)  is  demonstrated  on  the  results  shown 
in  Figures  8  and  9.  It  can  be  seen,  for  example,  that  in  Figure  10  the  same  general 
trend  appears  in  these  results  except  that  now  the  error  probabilities  produced 
are  better  (i.e.,  smaller).  This  of  course  is  directly  related  to  the  increase  in 
training  data  size,  which  makes  estimation  of  the  underlying  symbol  probabilities 
more  accurate.  Additionally,  note  that  by  comparing  Figure  8  to  Figure  10, 
another  effect  of  increasing  the  training  data  size  is  to  diminish  the  variation 
of  error  probabilities  with  the  number  of  relevant  features.  This  then  implies 


45 


Figure  11:  Performance  comparison  shown  in  Figure  9  with  one  hundred  samples 
of  training  data  for  each  class. 

that  more  training  data  is  helping  to  identify  features  which  are  useful  to  correct 
classification. 

Continuing  with  Figure  11  as  was  done  in  Figure  9,  the  performance  compar¬ 
ison  to  a  neural  network  is  now  illustrated  using  the  larger  training  data  size  of 
one  hundred  samples.  Notice,  the  results  labeled  as  Optimal  and  Merged  (True) 
are  again  plotted  from  Figure  10,  and  their  sample  standard  deviations  are  given 
by  y/S.  Also,  the  evident  zigzag  pattern  in  the  results  is  attributed  to  being  an 
artifact  of  the  way  in  which  the  true  symbol  probabilities  are  generated. 


46 


It  can  be  seen  in  Figure  11  that  the  BDRA  outperforms  the  neural  network,  as 
it  did  in  Figure  9,  for  all  possible  numbers  of  relevant  features.  Also,  it  is  apparent, 
and  this  was  previously  observed  in  Figure  10,  that  adding  more  training  data 
has  improved  the  performance  of  both  classifiers,  as  their  error  probabilities  are 
smaller  than  they  were  in  Figure  9. 

With  these  observations  also  notice  in  Figure  11,  and  this  was  not  as  apparent 
in  Figure  9,  that  performance  of  the  neural  network  appears  to  be  less  dependent 
on  the  number  of  relevant  features  than  it  is  for  the  BDRA.  That  is,  the  BDRA 
has  a  tendency  to  be  somewhat  “eager”  to  reduce  the  training  data.  For  an 
illustration  of  this  consider  Figure  12  below. 

In  Figure  12  below  is  shown  the  estimated  number  of  relevant  binary  valued 
features  by  the  BDRA,  as  a  function  of  the  actual  number,  for  both  training  data 
sizes  used  previously.  In  other  words,  what  appears  in  this  figure  is  the  number 
of  features,  on  average,  that  remained  after  applying  the  BDRA  to  the  training 
data.  In  general,  completely  accurate  estimates  would  produce  a  straight  line 
with  a  slope  of  unity.  However,  it  can  be  seen  that  even  with  the  larger  training 
data  size  of  one  hundred  samples  the  algorithm  tends  to  over-reduce  the  data. 
The  one  exception  to  this  is  the  obvious  case  of  one  true  relevant  feature.  But, 
notice  a  trend  appearing  again  here  in  that  more  training  data  helps  in  identifying 
those  features  which  are  most  relevant  to  correct  classification  (i.e,  the  estimates 
of  relevant  feature  number  increase  with  training  size). 


47 


2.4 


Figure  12:  Estimated  number  of  relevant  binary  valued  features  by  the  BDRA. 
3.3.2  Performance  at  Reducing  Ternary  Valued  Irrelevant  Features 

In  Figures  13  and  14  below,  the  number  of  discrete  levels  is  increased  by  one 
for  each  feature  so  that  the  training  data  consists  of  six  ternary  valued  features. 
These  figures  are  used  to  illustrate  performance  of  the  BDRA  as  the  “curse  of 
dimensionality,”  [2,  23],  becomes  more  predominant  in  the  data.  Notice,  these 
figures  are  similar  to  Figures  8  and  9  as  they  show  error  probabilities  of  the 
BDRA  as  a  function  of  the  number  of  relevant  features  for  each  class.  But,  it 
should  also  be  pointed  out  that  with  ternary  valued  features  (or  those  with  a 
larger  number  of  discrete  levels)  data  reduction  means  all  discrete  levels  of  each 


48 


0.5 


Figure  13:  Performance  of  the  BDRA  with  six  ternary  valued  features,  and  twenty 
five  samples  of  training  data  for  each  class. 

feature  are  successively  reduced  one  level  at  a  time.  That  is,  if  reduced  a  ternary 
valued  feature  is  first  reduced  to  being  binary  valued,  then,  if  reduced  again  it  is 
eliminated.  Further,  results  in  these  figures  are  based  on  twenty  five  samples  of 
training  data  for  each  class. 

The  BDRA  starts  out  in  Figure  13  with  a  probability  of  error  of  near  0.5 
(Unmerged  (Training  Data)),  and  then  reduces  this  to  less  than  0.15  (Merged 
(Training  Data))  for  all  numbers  of  true  relevant  features.  Clearly,  as  compared 
to  Figure  8  the  effect  of  the  curse  of  dimensionality  can  be  seen  in  the  high 
initial  error  probabilities.  With  this,  under  true  statistics  and  for  one  relevant 


49 


feature  for  each  class,  the  BDRA  is  able  to  reduce  the  (Unmerged  (True))  error 
probability  from  greater  than  0.35  to  near  optimal  (Merged  (True)).  In  this  way, 
the  BDRA  is  an  effective  means  of  eliminating  the  curse  of  dimensionality,  and 
this  is  addressed  further  in  Chapter  6  where  the  BDRA  is  modified  to  improve  its 
performance  with  high  dimensional  data.  Also,  and  as  expected  from  Figure  12, 
the  relative  loss  in  performance  for  the  BDRA  as  the  number  of  relevant  features 
increases  is  more  severe  with  ternary  valued  features.  Again,  this  is  attributed 
to  the  curse  of  dimensionality. 


Figure  14:  Performance  comparison  of  the  BDRA  to  a  neural  network  with 
ternary  valued  features,  and  twenty  five  samples  of  training  data  for  each  class. 


50 


In  Figure  14,  performance  of  the  BDRA  is  compared  to  a  neural  network.  In 
this  figure  appears  the  Optimal  and  Merged  (True)  results  from  Figure  13,  results 
for  the  neural  network,  and  their  respective  average  sample  standard  deviations, 
VS.  Notice,  the  BDRA  is  shown  to  be  overall  superior  to  the  neural  network 
except  when  all  of  the  features  are  relevant.  Also,  it  can  be  seen  that  an  additional 
effect  of  a  small  number  of  training  data  is  the  relatively  large  sample  standard 
deviations  of  the  BDRA  and  the  neural  network.  However,  as  will  be  seen  in 
Chapter  6  the  negative  effects  of  small  sample  size  can  be  reduced  (i.e.,  the 
BDRA  can  be  improved)  if  the  BDRA  is  modified  to  work  only  with  those  discrete 
symbols  represented  by  the  training  data. 

3.4  Summary 

In  this  chapter,  the  Bayesian  Data  Reduction  Algorithm  (BDRA)  was  devel¬ 
oped  using  the  noninformative  Dirichlet  distribution  as  a  prior  on  the  symbol 
probabilities.  Additionally,  the  overall  performance  of  the  BDRA  was  demon¬ 
strated,  and  it  was  shown  to  be  superior  to  a  neural  network  at  reducing  irrel¬ 
evant  binary  and  ternary  valued  features  from  the  training  data  of  each  class. 
But,  as  expected,  when  both  classes  contain  a  complete  set  of  relevant  features, 
performance  of  the  BDRA  and  a  neural  network  are  similar. 


51/(52  blank) 


Chapter  4 


The  CBT  and  Mislabeled  Training  Data 


4.1-_  Introduction 

The  subject  of  this  chapter  is  to  demonstrate  performance  of  the  CBT,  using 
the  average  probability  of  error,  P(e),  when  the  training  data  of  each  class  are 
mislabeled.  In  this  case,  classification  performance  is  shown  to  degrade  when  mis¬ 
labeling  exists  in  the  training  data,  and  this  occurs  with  a  severity  that  depends 
upon  the  mislabeling  probabilities.  Additionally,  it  is  shown  that  as  the  mislabel¬ 
ing  probabilities  increase  M*,  or  the  best  quantization  complexity  related  to  the 
Hughes  phenomenon  (see,  [23,  25,  34]),  also  increases.  Notice,  that  even  when 
the  actual  mislabeling  probabilities  are  known  by  the  CBT  it  is  not  possible  to 
achieve  the  classification  performance  obtainable  without  mislabeling.  However, 
it  is  also  shown  that  the  negative  effect  of  mislabeling  can  be  diminished,  with 
more  success  for  smaller  mislabeling  probabilities,  if  the  BDRA  of  Chapter  3  is 
applied  to  the  training  data. 


53 


Observe,  with  the  situation  of  interest  the  training  data  of  each  class  are  as¬ 
sumed  to  be  made  up  of  two  parts:  a  correctly  labeled  part,  and  a  mislabeled 
part.  Specifically,  the  Nk  ( Nt )  training  data  of  class  k  (/)  consists  of  Nkk  (Nu) 
correctly  labeled  observations  occurring  with  probability  1  —  a*  (1  —  a/),  and  a 
remaining  Nki  (Nik)  mislabeled  observations  (i.e.,  belonging  to  the  other  class) 
occurring  with  probability  a*  (at).  Also,  it  is  assumed  that  Ny  unlabeled  “test” 
data  are  observed.  Thus,  the  problem  addressed  in  this  chapter  is  to  illustrate,  us¬ 
ing  P(e),  the  effect  that  mislabeled  training  data  has  on  classifying  the  unknown 
test  data. 

4.2  Classification  With  Mislabeled  Training  Data 

4.2.1  Combined  Multinomial  Model 

The  combined  multinomial  for  mislabeled  training  data  is  an  extension  of  that 
shown  in  formula  (1)  of  Chapter  2.  Thus,  the  joint  distribution  for  the  frequency 
of  occurrence  of  all  training  and  test  data  with  the  test  data,  y,  a  member  of 
class  k  is  given  by 


/  (xfcfc,  xu,  xu,  xM,  y  Ip*,  p  i,  Hk;  ak,  on) 

M  xkk,i+*lk,i+yi  XU, i+*kl,i 
=  NkklNtklNulNkilNyl  JI  - ,  7  ,  , 

x  —  (a-,)Nkt  (1  -  ock)Nkk  —  (a,)N,k  (1  -  a,)Nlt 

Nkk\Nki\ {  k)  [  k)  NulNikl {  0  1  ’ 


(20) 


where  (as  in  previous  chapters,  k  and  l  are  exchangeable) 


54 


k,  l  €  {class  1,  class  2},  and  k  l; 

Hk  is  the  hypothesis  defined  as  py  =  p k, 

M  is  the  number  of  discrete  symbols; 

Xkk,i  is  the  number  of  occurrences  of  the  ith  symbol  in  the  correctly  labeled  train¬ 
ing  data  for  class  k; 

Nkk  {./Va*  =  YaL\  Xkk,i\  is  the  number  of  correctly  labeled  training  data  for  class 

fc; 

Xki,i  is  the  number  of  occurrences  of  the  ith  symbol  in  the  mislabeled  training 
data  for  class  k,  appearing  with  probability  ak  and  belonging  to  class  /; 
Nkr{Nki  =  Eiil  Xki^  is  the  number  of  mislabeled  training  data  for  class  k\ 

Xk,i  —  x kk,i  +  %ki,i  is  the  number  of  occurrences  of  the  ith  symbol  in  all  training 
data  for  class  k; 

Nk  {Nk  =  Nkk  +  Nki  =  Ylfii  xk,i }  is  the  total  number  of  training  data  for  class 

fc; 

yi  is  the  number  of  occurrences  of  the  ith  symbol  in  the  test  data; 

Ny  {Ny  =  y,}  is  the  total  number  of  test  data; 

Pk,i  i  Pk,i  =  l}  is  the  probability  of  the  ith  symbol  for  class  k. 

4.2.2  Combined  Bayes  Test  (CBT) 

The  first  step  in  developing  the  CBT  for  mislabeled  training  data  is  to  apply 
the  Dirichlet  of  formula  (2),  Chapter  2,  to  the  formula  of  (20)  under  each  class 


55 


k  and  l,  and  then  integrate  with  respect  to  and  p/  over  the  positive  unit- 
hyperplane  resulting  in 


/  (xfcfc,  xik,  xu,  xki,  y \Hk;  ak,  on ) 


[{M  —  l)!]2  NkklNiklNulNkilNy 


(Nkk  +  Nik  +  Ny  +  M  —  1)\ (Nu  +  Nki  +  M  —  1)! 
^  {xkk,i  H"  %lk,i  +  2/i)l  ip'll, i  H"  Xkl,i)\ 

1=1 


X 


XkkAxik,i\xiui\xki,i\yi\ 

Nkl  (a,)N"  (i  _  a,)Nkk  -31—  (a,)N‘*  a  _  a.)*» 

Nkk\Nkl\[k)  1  k)  N,i\Nlk\{l)  1  l)  ‘ 


(21) 


Continuing,  formula  (21)  is  now  expressed  in  terms  of  the  complete  training 
data  vectors,  xk  and  X/.  This  is  accomplished  by  substituting  the  definitions 
Xkk  =  Xk  —  Xki  and  Xu  =  x/  —  x/*  into  formula  (21),  followed  by  summing  over  all 
possible  arrangements  of  mislabeled  training  data  vectors,  yielding 


f{xk,xi,y\Hk-,ak,ai) 

Xfc  Xf 

=  E  E  f(xk-Xki,xik,xi-xik,Xki,y\Hk-,ak,ai).  (22) 

xfci=0xifc=0 

Using  this  result,  the  CBT  is  then  given  by  the  ratio  of  (22)  to  its  analogous 
formula  under  class  l  (i.e.,  conditioned  on  Hi),  and  it  appears  as 

/  (x<;,  x/,  y Qfe,  07)  /2„v 

f  {xk,xhy\Hi;ak,ai)  < 

where,  for  minimizing  the  probability  of  error  the  decision  threshold  r  is  equal 
to  P(H,)/P(Hk). 


56 


4.2.3  Probability  of  Error 


Letting  zk  =  /  (x*,  x;,y|^;  ak,  a;)  (see  formula  (22)  above),  the  average  prob¬ 
ability  of  error  for  the  CBT  is  defined  as 

P(e)  =  P  (Hk)  P  (zk  <  rzt  |  Hk)  +  P  (Hi)  P  (zk  >  rzt  \  Ht) .  (24) 

It  is  necessary  to  show  the  first  term  of  (24)  only  as  the  second  term  is  similar 
except  for  conditioning  on  Hi.  Thus,  ignoring  P  (Hk),  the  first  term  of  (24)  is 
given  by 


P{zk  <  rzt  |  Hk)  =Yt'HYl:i:{^<rzl}f('Xk,Xi,y\Hk;ak,ai)  (25) 
y  xk  xt 

where  f  (xk,Xi,y\Hk-,ak,ai)  was  defined  in  formula  (22)  above. 

Before  illustrating  results  for  the  CBT  notice  that  analogous  to  formula  (9)  of 
Chapter  2  formula  (25)  can  also  be  rewritten  for  Ny  =  1,  and  this  requires  that 
/  (x,t,  X;,  y\Hk;  ak,  ai)  be  given  by  (see  the  accompanying  text  of  formula  (8)  in 
Chapter  2  for  notational  descriptions) 


;ock,ai) 

X*„  Xin 


xkfir  xl,ir  Xy 

=  'm2f(xk,ir,xi,ir,yir  =  l\Hk-,ak,ai)=  £  £  £ 

Xkn  Xln  ^klfir=0  Xiktir=z0^2Xkkn=Qj2xun=:0 

x  [(M  -  l)!]2  NkklNiklNii\Nkil  (xkktir  +  xik,jr  +  1)!  (a?»,ty  +  xki,jr)\ 
{Nu  +  Nki  +  M  —  1)!  (Nkk  +  Nik  +  M)\  Xkk,iAxik,iAxii,iAxki,iA 


57 


X 


Ex*jb»  +  £x/*n  +  M-  2 


E  x//n  +  E  xMn  +  M  “  2 


E  ^kkn  “t"  E  X/A;ra 


E  x/fn  “1“  E  x£/n 


X  *r  ,  (at)"“  (1  -  a*)""  (<*!)"“  (1  -  Q|)W“  •  (26) 


3'kk,ir^3'kl,ir^ 


Notice,  as  with  formula  (9)  of  Chapter  2,  because  of  symmetry  in  the  Dirichlet 

distribution  formula  (26)  is  equal  for  all  “ir”  in  {1,2,  With  this,  when 

Ny  =  1  the  summation  over  y  in  formula  (25)  involves  a  sum  of  the  same  M 

terms.  Also,  the  notation  J2x.kkn  means  the  sum  of  all  correctly  labeled  training 

data  under  k  that  are  not  represented  by  the  same  discrete  symbol  as  the  test 

/ 

Exjfcifcn  +  EXite  +  M  -  2 

observation.  Further, 

^  23  ~X-kkn  “I"  23  3C Ikn 

23  Xfcfcn  +  23  x/fcn  training  data  can  be  arranged  amongst  M  —  1  discrete  symbols. 

Now,  for  larger  values  of  Ny  formula  (26)  can  be  straightforwardly  extended. 

For  example,  Ny  =  2  requires  the  sum  of  two  terms.  That  is,  formula  (26) 

must  be  extended  to  obtain  /  ( Xk,i ,  a?/,;,  y%  =  2| Hk]  a*,  a/),  and  the  formula  it  is  to 

be  summed  with,  /  (xk,i,  xkj,  xt>i,  z/j,  yt  -  \,y,  =  1| Hk;ak,ai).  In  any  case,  the 

benefit  of  using  formula  (26)  is  that  it  simplifies  the  necessary  computations  of 

formula  (24),  and  the  results  shown  below  are  based  on  this  simplification. 


4.3  Results 

Figure  15  below  contains  an  average  probability  of  error  curve,  P(e)  (plotted 
as  a  function  of  the  number  of  discrete  symbols,  M),  for  the  CBT  given  the  true 
mislabeling  probabilities  are  given  by,  respectively,  a*  =  a*  =  0.0,  0.05,  0.15, 


58 


Figure  15:  P(e)  with  various  mislabeling  probabilities. 

0.25,  0.35,  and  0.45.  Notice  that  results  are  based  on  ten  samples  of  training 
data  for  each  class,  and  one  observation  of  test  data.  Additionally,  the  decision 
threshold  r  —  1.  In  all  cases  of  Figure  15,  observe  that  P(e)  starts  out  decreasing 
with  increasing  M  and  is  minimum  at  the  point  called  M* .  For  example,  when 
there  is  no  mislabeling  in  the  training  data  M*  =  4,  and  this  case  previously 
appeared  in  Figure  4  of  Chapter  2.  Then,  for  M  greater  than  M*  P(e)  steadily 
increases.  This  dependence  of  P(e)  on  M  was  addressed  in  Chapter  2  and  it 
reflects  the  fact  that  given  a  fixed  amount  of  training  and  test  data  a  prior 
quantizing  complexity  exists  which  yields,  on  average,  the  “best”  classification 


59 


performance  [34].  However,  as  the  mislabeling  probabilities  are  fixed  with  larger 
values  overall  performance  begins  to  degrade  in  that  P(e )  increases.  Also,  it  can 
be  seen  that  accompanying  this  degradation  in  performance  is  an  increase  in  M*. 
That  is,  for  the  mislabeling  probabilities  in  Figure  15  given  by  0.0,  0.05,  0.15, 
0.25,  0.35,  and  0.45,  M*  has  the  respective  values  of  4,  5,  6,  8,  10,  and  12.  Notice, 
it  can  be  seen  that  even  when  a*  =  cq  =  0.45,  a  best  quantization  complexity 
exists.  Intuitively,  an  increase  in  the  mislabeling  probabilities  causes  the  classes 
to  become  similar,  so  that  for  best  classification  performance  more  information 
(i.e.,  a  finer  quantization)  is  required. 

With  these  findings,  it  was  found  that  if  the  mislabeling  probabilities  assumed 
for  the  training  data  (i.e.,  for  the  and  zi  of  formula  (24))  take  on  any  values 
within  the  range,  0  <  a*  =  at  <  0.5  ,  identical  results  are  produced  for  all  cases 
in  Figure  15.  In  other  words,  when  testing  it  is  does  not  matter  if  the  CBT  of 
formula  (23)  contains  the  true  mislabeling  probabilities  as  long  as  they  are  not 
assumed  to  be  0.5  or  higher  (which  would  indicate  a  CBT  that  is  testing  as  if  most 
of  the  training  data  of  each  class  is  more  likely  to  belong  the  other  class).  This 
aspect  of  the  CBT’s  performance  is  attributed  to  the  averaging  which  occurs  in 
formula  (22)  over  all  possible  orderings  of  the  mislabeled  training  data,  coupled 
with  placement  of  the  uniform  (i.e.,  Dirichlet)  prior  on  the  symbol  probabilities. 

In  Figure  16  below,  results  from  Figure  15  are  repeated  (i.e.,  Ny  =  1)  for  the 
mislabeling  probabilities  given  by  a*  —  oq  =  0.0,  0.05.  0.15,  and  0.25.  Addition¬ 
ally,  also  shown  (lower  curves  with  an  *)  for  the  same  mislabeling  probabilities 


60 


Figure  16:  P(e )  with  more  test  observations. 

is  the  case  involving  two  observations  of  test  data  (i.e.,  Ny  =  2).  It  can  be  seen 
in  this  figure  that  when  Ny  =  2  performance  improves  for  a  given  mislabeling 
probability  as  P{e)  is  reduced  (see  Figure  4  of  Chapter  2).  With  this,  observe 
that  as  compared  to  the  Ny  =  1  case,  increasing  the  number  of  test  observa¬ 
tions  to  Ny  =  2  causes  all  associated  values  of  M*,  where  performance  is  best, 
to  increase  by  one.  That  is,  for  the  mislabeling  probabilities  given  by  0.0,  0.05, 
0.15,  and  0.25,  and  when  Ny  =  2,  M*  has  the  respective  values  of  5,  6,  7,  and  9. 
Accompanying  this  increase  in  M*  is  the  associated  increase  in  P(e).  However, 
it  is  apparent  that  the  increase  in  P(e)  is  relatively  worse  when  Ny  =  2  (i.e.,  the 


61 


P(e)  curves  are  further  apart).  The  reason  this  occurs  is  that  although  a  greater 
number  of  test  observations  improves  the  estimation  capability  of  the  CBT,  there 
also  is  more  of  a  likelihood  that  a  test  observation  will  be  of  the  same  value  as  a 
mislabeled  training  datum. 

4.4  Applying  the  BDRA  to  Mislabeled  Training  Data 

In  this  section  the  Bayesian  Data  Reduction  Algorithm  (BDRA)  is  applied 
to  mislabeled  training  data.  As  used  here,  the  BDRA  demonstrates  the  degree 
to  which  the  negative  effect  of  mislabeling  can  be  diminished  by  employing  a 
suboptimal  algorithm  to  train  on  the  data.  Performance  of  the  BDRA  is  described 
in  Chapter  3  at  classifying,  and  reducing,  feature  vectors  containing  binary  and 
ternary  valued  features.  In  these  cases,  the  BDRA  was  shown  to  be  superior  to 
a  neural  network.  Here,  the  BDRA  is  applied  to  feature  vectors  consisting  of  six 
binary  valued  features  (i.e.,  M  =  64),  which  are  also  mislabeled  in  the  training 
data  of  each  class  according  to  the  probabilities  shown  in  Figure  15. 

Recall,  the  BDRA  works  by  reducing  the  quantization  complexity,  M,  of 
the  training  data  to  a  level  which  minimizes  the  average  conditional  probability 
of  error,  P  (e  |  X)  (where  X  represents  the  entire  collection  of  training  data 
from  all  classes).  It  was  shown  in  Chapter  3  that  the  formula  for  P  (e  j  X)  is  a 
fundamental  component  of  the  BDRA,  and  repeated  here  it  is  given  by  (note, 
the  notational  descriptions  of  formula  (20)  apply  to  formula  (27)) 


62 


P(e|AT)  =  P(e|x*,x,) 

=  E  E  +  f W) !(,.>,}/ (yl^/fo  (27) 

y  x*,xj 

where  in  the  cases  considered  involving  only  one  observation  of  test  data  (i.e., 
Ny  =  1)  zk  =  /  (y|xfe,  Hk )  =  (xk,i  +  1)  /  ( Nk  +  M). 

For  binary  valued  feature  vectors  the  six  iterative  steps  of  the  BDRA  are  also 
repeated  here  from  Chapter  3. 

1.  Using  the  initial  training  data  with  quantization  complexity  M  (i.e.,  M  = 

~~  2Nf,  where  Nf  is  the  number  of  features),  use  formula  (24)  to  compute 

P(e\  X]  M). 

2.  Beginning  with  the  first  feature  (selection  is  arbitrary),  remove  this  feature 
from  each  class  by  summing  (i.e.,  merging)  the  numbers  of  occurrences  of 
those  discrete  symbols  that  correspond  to  its  removal  (i.e.,  for  all  classes 
simultaneously  merge  those  quantized  symbols  containing  a  binary  zero  for 
that  reduced  feature  with  those  containing  a  binary  one). 

3.  Use  the  newly  merged  training  data  (X')  and  the  new  quantization  com¬ 
plexity  ( M '  =  2Ni~l),  and  compute  P(e  |  X';  M'). 

4.  Repeat  items  two  and  three  for  all  Nj  features. 

5.  From  item  four  select  the  minimum  of  all  computed  P(e  |  X';  M')  (ties  are 
broken  arbitrarily),  and  choose  this  as  the  new  training  data  configuration 


63 


for  each  class  (this  corresponds  to  permanently  removing  the  associated 
feature). 

6.  Repeat  items  two  through  five  until  the  probability  of  error  does  not  de¬ 
crease  any  further,  or  M'  =  2,  and  this  defines  the  new  quantization  com¬ 
plexity. 

4.5  Results  Using  the  BDRA 

In  Figure  17  below,  performance  of  the  BDRA  is  shown  when  the  training  data 
axe  mislabeled  according  to  the  probabilities  specified  in  Figure  15.  The  results 
in  this  figure  are  based  on  an  average  of  one  hundred  independent  trials.  At  each 
trial,  a  set  of  M  =  64  true  symbol  probabilities,  consisting  of  six  independent 
bit  probability  pairs,  were  generated  for  both  classes  using  Gaussian  mixture 
distributions  (see  Chapter  3).  Additionally,  results  appear  for  two  training  data 
sizes  of  twenty  five  and  one  hundred  samples,  which  were  randomly  generated  at 
each  trial  from  the  true  symbol  probabilities. 

Observe  in  Figure  17  that  P(e)  appears  as  a  function  of  the  mislabeling  prob¬ 
abilities  both  with  and  without  applying  the  BDRA  to  the  training  data,  and 
for  the  optimal  test.  Note  that  results  shown  for  the  BDRA  were  obtained  by 
using  its  trained  test  statistic  with  the  actual  symbol  probabilities.  Also,  optimal 
results  are  based  on  the  test  which  knows  all  true  symbol  probabilities,  and  there 
is  no  mislabeling  of  the  training  data.  With  this,  it  can  be  seen  that  the  optimal 


64- 


Figure  17:  Performance  of  the  BDRA  with  the  mislabeling  probabilities  of  Figure 
15. 

error  probabilities  are  relatively  constant  at  0.075,  and  this  is  due  to  them  having 
been  constrained  to  be  >  0.05  and  <  0.1.  This  constraint  on  the  optimal  error 
probabilities  was  possible  because  the  true  symbol  probabilities  were  created  with 
Gaussian  mixture  distributions  (see  Chapter  3).  Notice  in  Figure  17  that  in  all 
cases  performance  degrades  with  the  severity  of  the  mislabeling  probabilities, 
which  is  analogous  to  the  results  in  Figure  15.  However,  for  both  training  data 
sizes  the  BDRA  is  successful  at  improving  overall  classification  performance  (rel¬ 
atively  less  improvement,  that  is,  less  data  reduction,  occurs  with  more  training 


65 


data  as  the  probability  estimates  are  more  accurate).  But,  in  all  cases  it  ap¬ 
pears  that  the  improvement  diminishes  rapidly  as  the  mislabeling  probabilities 
approach  0.5.  On  the  other  hand,  with  one  hundred  samples  of  training  data 
and  mislabeling  probabilities  of  less  than  0.1,  performance  is  relatively  close  to 
optimal. 


Figure  18:  Average  number  of  relevant  features  reduced,  out  of  a  total  of  six, 
from  the  training  data  of  each  class. 

Figure  18  shows  the  average  number  of  relevant  features  reduced,  out  of  a 
total  of  six,  from  the  training  data  of  each  class  by  the  BDRA  as  a  function  of 
the  true  mislabeling  probabilities.  In  this  figure,  results  appear  for  those  training 
data  sizes  shown  in  Figure  17,  and  are  based  on  an  average  of  one  hundred 


66 


independent  trials.  As  expected  from  Figure  15,  overall  it  can  be  seen  that 
the  average  number  of  features  reduced  (eliminated)  from  the  training  data  of 
each  class  becomes  less  as  the  mislabeling  probabilities  increase  (the  increase 
in  the  number  of  features  associated  with  a  mislabeling  probability  of  0.05  is 
attributed  to  using  only  one  hundred  independent  trials  to  obtain  the  results). 
Also,  consistent  with  the  results  of  Figure  17,  the  BDRA  appears  to  reduce  a 
larger  number  of  features  when  there  is  less  training  data,  and  this  is  caused  by 
relatively  more  uncertainty  associated  with  the  symbol  probability  estimates. 

4.6  Summary 

The  subject  of  this  chapter  has  been  the  effect  that  mislabeled  training  data 
has  on  classification  performance,  given  there  is  no  knowledge  of  the  underlying 
discrete  symbol  probabilities.  In  general,  it  was  shown  that  as  the  mislabeling 
probabilities  increase,  both  the  average  probability  of  error  and  the  optimum 
quantization  complexity,  M*,  increase.  Additionally,  it  was  found  that  P(e)  can 
be  reduced  if  the  number  of  test  observations  is  increased  to  Ny  =  2.  However, 
the  relative  performance  degradation  with  mislabeling  present  is  relatively  larger 
then  it  is  when  Ny  =  1,  and  this  is  due  to  an  increased  likelihood  of  the  test 
data  matching  the  mislabeled  training  data.  Further,  the  BDRA  was  applied  to 
training  data  corrupted  by  mislabeling,  and  results  indicate  that  classification 
performance  can  be  improved  if  the  mislabeling  probabilities  are  not  too  severe. 
But,  the  relative  amount  of  improvement  decreases  with  training  data  size  as  the 
symbol  probability  estimates  become  more  accurate. 


67/(68  blank) 


Chapter  5 


The  BDRA  and  Missing  Features 


5.1__.  Introduction 

In  this  chapter,  the  BDRA  is  used  to  classify  test  observations  given  that  the 
training  data  of  each  class  is  missing  feature  information.  Observe,  by  missing 
features  it  is  meant  that  each  of  the  JV*  feature  vectors  of  the  training  data 
for  class  k  are  assumed  to  be  made  up  of  either  or  both  of  the  following  two 
observation  types:  features  which  are  represented  by  discrete  values,  and  missing 
features  which  have  no  values.  For  example,  with  three  binary  features  a  possible 
feature  vector  that  is  missing  a  single  feature  might  appear  as  (l,l,x),  where 
x  represents  the  missing  value.  In  this  case,  x  can  have  the  value  of  0  or  1  so 
that  this  feature  vector  has  a  cardinality  of  two.  Notice,  the  missing  features  are 
assumed  to  appear  according  to  an  unknown  probability  distribution.  But,  when 
simulating  training  data  with  missing  features  a  uniform  random  variable  is  used 
to  control  their  frequency  of  occurrence. 


69 


Now,  in  the  BDRA  the  missing  feature  information  is  modeled  using  two 
different  approaches.  With  the  first  of  these  approaches,  the  Dirichlet  prior  is 
extended  to  accommodate  missing  features  in  the  natural  way.  That  is,  each 
missing  feature  is  assumed  to  be  uniformly  distributed  over  its  range  of  values. 
In  the  second  approach,  the  number  of  discrete  levels  for  each  feature  is  increased 
by  one  so  that  all  missing  values  for  that  feature  are  assigned  to  the  same  level.  To 
illustrate  performance,  the  BDRA  is  compared  to  a  neural  network  at  classifying 
binary  valued  feature  vectors,  with  the  missing  features  appearing  randomly  in 
the  training  data  of  each  class.  Note,  in  the  case  of  the  neural  network  each  miss¬ 
ing  'feature  is  assigned  to  an  additional  value  as  is  done  for  the  second  approach 
of  the  BDRA.  In  general,  simulation  results  with  six  binary  valued  features  re¬ 
veals  that  both  approaches  to  modeling  missing  features  in  the  BDRA  perform 
similarly,  and  they  also  are  both  superior  to  the  neural  network. 

5.2  The  BDRA  Extended  for  Missing  Features 
5.2.1  Method  1 

In  the  first  method,  development  of  the  BDRA  for  the  missing  features  prob¬ 
lem  relies  on  the  underlying  assumption  that  for  each  missing  feature  vector  its 
cardinality  of  values  (i.e.,  this  is  all  possible  discrete  symbols  the  feature  vector 
can  take  on  if  all  possible  arrangements  of  values  are  substituted  in  for  the  miss¬ 
ing  features)  are  uniformly  distributed.  Given  that,  the  probability  of  observing, 
for  the  kth  class,  a  specific  arrangement  of  discrete  symbols  in  the  training  data 


70 


(wfc)  and  a  single  test  observation  of  type  i  (y,  =  1)  is  defined  as  (extensions  to 
Ny  >  1  are  shown  below  to  be  straightforward) 


Nk 

f  (w k,  yi  =  1  |  pk;  Hk )  =  pkti  JJ  £  pk>n 

j=l  ntwkj 


(28) 


where  (also  see  the  notational  definitions  of  Section  3.2  of  Chapter  3)  wkj  is  a 
single  observation  of  a  feature  vector  in  the  training  data  of  class  k.  Notice, 
without  missing  features  wkj  is  a  single  observation  of  a  symbol  of  type  i,  and 
with  missing  features  wkj  is  one  of  |  wkj  |  possible  symbols  (i.e.,  |  wkj  |  is 
the  cardinality  of  wkj  or,  the  number  of  possible  symbols  it  can  take  on  after 
substituting  in  all  arrangements  of  missing  feature  values). 

Notice,  the  formula  of  (28)  above  represents  the  sum  of  I  wk,j  \  terms 
(or  all  possible  arrangements  of  the  training  data  given  the  missing  feature  infor¬ 
mation)  each  having  the  form 


5  •  •  •  5  Pk*1 5  •  •  •  ?  pI^m)  (29) 

where,  for  example,  b  +  1  is  the  number  of  pk/ s  in  the  product. 

Now,  the  multinomial  coefficients,  cmO  ’  are  mu^iplied  by  the  prod¬ 

uct  in  (29)  and  the  uniform  Dirichlet  distribution  of  formula  (2)  in  Chapter  2. 
Then,  integration  is  carried  out  over  the  unit  simplex  producing  the  result 


(*  +  l) 


Nk\(M  —  1)! 
(iVft  +  M)!  • 


(30) 


71 


Thus,  applying  the  above  result  to  all  terms  in  formula  (28)  yields 


/  (wt,  y,  =  1  |  g>)  =  E  Cfc  (»  +  1)  ^  m|,)!  (31) 

in  which  Cb  is  the  number  of  terms  in  each  product  of  formula  (28)  containing 
(6+1)  pk/s. 

From  here,  a  fictitious  Bernoulli  random  variable  Yj  is  defined  such  that 
J2f=i  Yj  =  6.  Thus,  if  Si  is  defined  as  the  event  of  being  all  those  Wkj  that 
can  take  on  symbol  i,  and  with  each  discrete  symbol  that  each  Wk,j  can  take  on 
being  equally  likely,  the  following  probabilities  are  straightforward  to  see, 


}  . 


0  j  €  5? 

Pr(Yj  =  !)=■! 

i  ra  ■7£Si 

Now,  using  (32)  and  the  definition  of  Cb  produces  the  formula 


(32) 


-1 


wKi 


Cb  =  Pr 


\ 

£>;  =  &  • 
/ 


(33) 


Substituting  formula  (33)  in  formula  (31),  and  summing  over  all  possible 
values  of  6  results  in 


f{wt’y‘ = 1  = (n  i  ««  i)  (! + s  j^jt)  •  <»> 

Given  the  result  above,  without  a  test  observation  formula  (34)  becomes 


72 


(35) 


Therefore,  for  missing  features  the  desired  conditional  distribution  for  the 
BDRA,  or  the  zk  of  formula  (18)  in  Chapter  3,  is  produced  by  dividing  formula 
(34)  by  formula  (33),  resulting  in 


Zk  =  f  (Vi  =  1  |  w*,  Hk) 


(*  +  gig 3  R7i) 


(36) 


Nk  +  M 

Now,  based  on  formula  (17)  of  Chapter  3,  it  is  straightforward  to  see  that 
formula  (36)  for  Ny  >  1  appears  as 


f  fvltr  ff  )  —  +  M  l)l(Nr)\  tt  \wkj\  y*)’ 

mwk’Ht)  -  (SwSraw- 


(37) 


The  final  step  with  this  method  is  to  develop  a  formula  for  the  situation  of 
missing  features  in  both  the  training  and  test  data,  and  this  is  given  by 


/(y|wjt,tffc)  = 


(N„  +  M-  1)! (ATy)l  "  1^-T  +  Sj€S„,  1^;)! 

W  +  Jvy  +  M_1)!ilfe<  ,  ),fe€s>  ,  y 


(38) 


where  (see  formula  (28)) 

wyj  is  a  single  observation  of  a  feature  vector  in  the  test  data; 

|  wVtj  |  is  the  cardinality  of  wy<j‘, 

Syj  is  defined  as  the  event  of  being  all  those  wyj  that  can  take  on  symbol  i. 


73 


5.2.2  Method  2 


The  BDRA  is  also  extended  for  the  missing  features  problem  using  a  second 
method,  which  requires  no  additional  probabilistic  development.  The  basic  idea 
of  this  method  is  to  increase  the  number  of  discrete  levels  by  one  for  each  feature 
that  has  missing  values  (this  actually  represents  one  of  several  possible  filling  in 
type  methods  used  with  neural  networks,  [7]).  Thus,  the  additional  level  created 
for  each  feature  is  used  to  represent  each  missing  value.  For  example,  for  six 
binary  valued  features,  and  if  it  is  known  that  any  of  these  features  can  be  missing 
from  either  the  training  or  test  data,  then  the  initial  quantization  complexity  is 
increased  from  M  =  64  to  M  =  729.  That  is,  each  feature  is  ternary  valued 
instead  of  binary  valued.  In  the  next  section,  results  are  shown  for  both  of  these 
methods. 

5.3  Results 

Figure  19  below  shows  error  probabilities  for  the  BDRA  (using  Method  1),  a 
neural  network,  and  the  optimal  test  as  a  function  of  the  true  number  of  relevant 
features  for  each  class  (each  class  contains  a  total  of  six  binary  valued  features). 
Note,  that  in  this  case  optimal  error  probabilities  have  been  constrained  to  be 
>  0.05  and  <0.1  (for  more  on  this  and  the  notation  used  here  see  Figures  8 
through  14  of  Chapter  3,  and  the  accompanying  text).  Additionally,  there  are 
twenty  five  samples  of  training  data  for  each  class,  and  results  are  based  on 
an  average  of  one  hundred  independent  trials.  For  the  case  shown  in  Figure 


74 


Figure  19:  Performance  comparison  of  the  BDRA  to  a  neural  network  when  a 
random  number  of  missing  features  occurs  with  a  probability  of  0.15. 


19,  a  0.15  probability  exists  for  each  class  that  up  to  three  randomly  selected 
features  will  be  missing  from  the  feature  vectors  in  the  training  data  (no  missing 
features  appear  in  the  test  data).  Also,  the  neural  network  is  trained  on  ternary 
valued  feature  vectors  where  a  third  discrete  level  is  used  for  each  missing  feature. 
Alternatively,  the  neural  network  was  also  trained  by  substituting  the  average 
of  the  known  feature  values  in  for  each  missing  feature,  and  this  produced  no 
substantial  change  in  performance.  Observe  in  Figure  19  that  the  BDRA  is 
superior  to  the  neural  network  by  achieving  an  overall  lower  probability  of  error. 
However,  the  sample  standard  deviation,  y/S ,  of  both  schemes  is  similar,  and  this 


75 


is  partly  due  to  a  tighter  constraint  on  the  optimal  error.  Additionally,  and  as 
previously  mentioned  in  Chapter  3,  the  BDRA  performs  best  when  the  number  of 
relevant  features  is  minimum,  whereas  the  opposite  is  true  for  the  neural  network. 
Further,  and  not  shown  here,  the  performance  of  both  classifiers  approach  each 
other  as  the  training  data  size  increases,  and  this  is  expected  based  on  the  results 
of  Chapter  3  (see  Figure  11). 


Figure  20:  Performance  comparison  of  Figure  19  repeated  using  the  BDRA  and 
Method  2. 


In  Figure  20  the  same  situation  appears  as  that  shown  in  Figure  19  except 
that  now  Method  2  is  used  for  the  BDRA  instead  of  Method  1.  It  can  be  seen  by 
comparing  Figures  19  and  20  that  both  methods  for  the  BDRA  perform  similarly 


76 


at  reducing  the  data  when  it  contains  missing  features.  Notice,  this  is  important 
in  Chapter  6  (see  Section  6.4)  where  the  BDRA  is  applied  to  a  high  dimensional 
data  set  because,  as  it  turns  out,  in  such  a  case  Method  2  is  less  computationally 
intensive  than  Method  1. 

With  these  results,  the  probability  of  missing  features  appearing  in  the  train¬ 
ing  data  was  increased  to  0.25,  and  in  this  situation  the  neural  network  showed 
more  of  a  relative  performance  loss  than  the  BDRA  (particularly  so  with  less  than 
three  relevant  features  per  class).  However,  overall  performance  for  both  tests 
was  similar  to  that  shown  in  Figures  19  and  20.  Additionally,  it  was  found  that 
both- tests  were  equally  effective  at  classifying  test  observations  which  contained 
missing  features  (see  formula  (38)). 

5.4  Summary 

The  performance  of  the  BDRA  has  been  discussed  at  classifying  six  binary 
valued  features  given  the  training  data  of  each  class  contains  missing  feature 
information.  In  adapting  the  BDRA  for  missing  features  two  methods  were  de¬ 
veloped.  In  the  first  method,  each  missing  feature  was  assumed  to  be  uniformly 
distributed  over  the  cardinality  of  possible  values  it  can  take  on.  With  the  sec¬ 
ond  method,  each  missing  feature  was  assigned  to  an  additional  discrete  level, 
which  was  obtained  by  increasing  the  number  of  discrete  levels  for  that  feature  by 
one.  Overall,  it  was  found  that  both  methods  produced  similar  results.  However, 


77 


similar  to  that  found  in  Chapter  3,  the  BDRA  (using  either  method)  was  demon¬ 
strated  to  be  superior  to  a  neural  network  at  reducing  irrelevant  features  from 
the  training  data.  In  general,  the  missing  feature  information  did  not  appear  to 
have  a  large  impact  on  degrading  classification  performance  (for  the  BDRA  or 
the  neural  network),  even  when  the  probability  was  moderately  high  for  missing 
features  to  occur. 


78 


Chapter  6 


Application  of  the  BDRA  to  Miscellaneous 
Problems  in  Classification 


6.1  Introduction 

In  this  last  chapter,  the  BDRA  is  applied  to  three  interesting  problems  in 
classification.  In  the  first  problem,  the  BDRA  is  applied  to  reducing  the  dimen¬ 
sionality  of  a  data  set  that  contains  class-specific  features,  and  its  performance 
is  compared  to  the  method  developed  in  [2].  With  this  comparison,  the  BDRA 
is  shown  to  be  an  effective  means  of  selecting  binary  valued  features,  which  have 
been  made  class-specific  in  an  ad  hoc  manner.  In  fact,  performance  results  reveal 
that  when  using  a  small  number  of  training  data  relative  to  feature  dimensional¬ 
ity  (and  when  class-specific  features  exist  for  each  class),  the  BDRA  outperforms 
the  class-specific  classifier  of  [2]. 


79 


In  the  second  problem,  the  BDRA  is  applied  to  the  fusion  of  features  ex¬ 
tracted  from  sonar  echoes  generated  by  independent  continuous  wave  (CW)  and 
frequency  modulated  (FM)  waveforms.  With  this  problem,  a  feature  vector  con¬ 
sisting  of  five  features  taken  from  CW  and  FM  track  pairs  is  used  for  detecting 
targets  in  various  littoral  environments.  To  illustrate  performance,  the  BDRA  is 
trained  and  tested  on  nearly  five  thousand  samples  of  real  sonar  data  consisting 
of  these  five  dimensional  feature  vectors.  Overall,  it  is  shown  that  the  BDRA 
improves  target  recognition  performance,  over  that  of  other  methods,  by  using 
three  of  the  five  features  and  quantizing  them  to  binary  values. 

With  the  final  problem,  the  BDRA  is  trained  and  tested  on  what  is  known  in 
the  literature  as  the  Australian  Credit  Card  Data  (ACCD),  [66].  Note,  the  ACCD 
is  based  on  the  actual  credit  history  of  690  applicants,  and  it  consists  of  fifteen 
features  (both  continuous  and  discrete  valued),  including  missing  features.  In 
terms  of  performance,  the  BDRA  is  shown  to  be  far  superior  to  a  neural  network 
at  classifying  the  test  data. 

6.2  The  BDRA  Applied  to  the  Selection  of  Class-Specific  Features 

In  [2],  a  novel  approach  to  reducing  the  dimensionality  of  a  feature  set  was 
developed  by  reformulating  the  optimum  Bayesian  classifier  for  C  classes,  and 
given  by 


«g  [/  (fe  W.,C  I*)  P  (*)]  (39) 


80 


into  an  equivalent  class-specific  classifier,  having  the  form 


arg  max 
°  Kk<C 


f  (Zk\Hk) 


^U(^\Ht,ek  =  e°t) 


P(H„ ) 


(40) 


where  (see  Chapters  2  and  3  for  more  on  notation) 

9k  completely  parameterizes  the  data,  Xk,  representing  the  kth  class, 
z k  =  Tj  (x)  is  a  sufficient  statistic  for  9k] 

f  (zk\Hk,9k  =  0°)  is  a  normalizing  distribution  and  is  the  same  for  all  k. 

Notice  that  formula  (39)  can  be  expressed  as  formula  (40)  because  each  class 
has  a  unique  sufficient  statistic,  Zk,  which  captures  all  relevant  information  about 
the  parameter  9k .  In  this  way,  formula  (40)  has  important  implications  for  data 
reduction.  That  is,  the  effects  of  the  curse  of  dimensionality  can  be  reduced 
if  /  ( Zk\Hk )  of  formula  (40)  is  estimated,  as  opposed  to  the  higher  dimensional 
/  °f  formula  (39).  Given  that,  the  problem  addressed  here  is 

to  determine  if  the  BDRA  can  effectively  reduce  irrelevant  information  from  a 
training  data  set  which  contains  class-specific  features.  As  a  measure  of  perfor¬ 
mance,  the  probability  of  error  for  the  BDRA  is  compared  to  the  probability  of 
error  for  formula  (40). 

To  illustrate  the  class-specific  classifier,  consider  Figure  21  below  where  the 
probability  of  error  is  plotted  versus  the  true  number  of  class-specific  features  (out 
of  a  total  of  six  binary  valued  features)  for  each  of  two  classes.  In  this  figure, 
probability  of  error  curves  are  shown  for  formula  (16)  of  Chapter  3  (CBT),  the 
class-specific  version  of  formula  (16)  (CBT  (class-specific)),  and  the  optimal  test. 


81 


In  this  case,  optimal  error  probabilities  have  been  constrained  to  be  >  0.05  and 
<  0.1  (  for  more  on  this  and  the  notation  used  here,  see  the  text  accompanying 
Figures  8  through  14  in  Chapter  3). 


Figure  21:  Performance  of  a  class-specific  classifier  with  binary  valued  features, 
and  five  samples  of  training  data  for  each  class. 

Observe  in  Figure  21  that  class  0  represents  the  common  null  class,  and  error 
probabilities  in  this  figure  (also  in  Figures  22  and  23)  are  based  only  on  classifying 
classes  1  and  2.  With  this,  in  generating  class-specific  features  for  each  class  an 
ad  hoc  approach  has  been  adopted  using  simulated  data.  That  is,  for  a  given 
true  number  of  class-specific  features  those  features  which  are  class-specific,  out 
of  the  total  six,  are  determined  randomly  for  each  class.  The  remaining  features 


82 


(i.e.,  those  which  are  not  class-specific)  are  then  distributed  according  to  the 
null  hypothesis,  Hq.  Additionally,  bit  probabilities  for  the  class-specific  features 
are  determined  using  a  Gaussian  mixture  distribution  with  a  random  number 
of  modes  (up  to  six  modes),  and  bit  probabilities  for  the  commonly  distributed 
features  are  based  on  a  single  Gaussian  distribution.  Also,  results  in  Figure  21 
are  based  on  an  average  of  two  hundred  fifty  independent  trials  where,  for  each 
trial,  ten  thousand  independent  samples  of  test  data  were  generated.  Further, 
there  are  five  samples  of  training  data  for  each  class  (including  the  null  class), 
and  the  sample  standard  deviations,  y/S,  are  given  for  each  test. 

it  can  bee  seen  in  Figure  21  that  the  class-specific  classifier  based  on  formula 
(40)  is  superior  to  the  classifier  based  on  formula  (39)  by  achieving  an  overall  lower 
probability  of  error.  However,  the  class-specific  classifier  also  shows  a  slightly 
higher  sample  standard  deviation  in  the  probability  of  error  than  does  the  non- 
class-specific  classifier.  Notice,  it  can  be  seen  that  as  more  class-specific  features 
are  added  to  the  feature  vectors  of  each  class  the  performance  of  both  classifiers 
becomes  similar.  With  this,  and  as  expected,  when  all  features  are  class-specific 
there  is  no  difference  in  performance  between  the  two  methods. 

An  apparent  observation  in  Figure  21  is  that  an  insufficient  number  of  training 
data  causes  the  performance  of  both  classifiers  to  be  significantly  above  optimal. 
Given  that,  the  effect  of  additional  training  samples  is  illustrated  in  Figure  22.  In 
this  case,  the  situation  of  Figure  21  is  repeated  except  that  each  class  now  contains 
fifty  samples  of  training  data.  As  expected,  observe  in  this  figure  that  not  only 


83 


0.25 


Figure  22:  Performance  of  Figure  21  repeated  with  fifty  samples  of  training  data 
for  each  class. 

has  additional  training  data  lowered  the  error  probabilities  of  both  classifiers  (and 
with  this  the  sample  standard  deviations  are  significantly  less),  but  performance 
of  the  two  methods  is  also  closer. 

In  Figure  23  below,  it  is  demonstrated  that  performance  can  be  improved  for 
the  situation  of  Figure  21  if  the  BDRA  is  applied  to  the  training  data.  Notice,  in 
this  figure  that  the  error  probability  curves  of  Figure  21  are  replotted,  and  results 
are  shown  for  the  BDRA  applied  using  three  different  methods  (results  for  each  of 
these  methods  were  produced  independently  using  the  same  randomly  generated 
symbol  probabilities,  but  with  different  randomly  generated  data).  In  the  first 


84 


True  number  of  class-specific  features  for  each  class 


Figure  23:  Performance  of  the  BDRA  with  class-specific  features  when  applied 
to  the  situation  of  Figure  21. 

method,  a  class-specific  version  of  the  BDRA  is  used  and  labeled  as  BDRA(class- 
specific).  Note,  this  method  is  essentially  based  on  formula  (40)  in  that  feature 
reduction  is  performed  separately  on  each  of  the  two  relevant  classes  versus  the 
null,  class  0.  It  can  be  seen  that  except  for  the  case  of  one  class-specific  feature 
per  class,  this  version  of  the  BDRA  improves  performance.  However,  in  the  next 
method  the  BDRA  is  applied  to  all  three  classes  simultaneously  and  it  appears  as 
BDRA(3).  Clearly,  results  shown  for  BDRA(3)  represent  an  overall  improvement, 
but,  the  last  method  shown  as  BDRA(2)  offers  the  best  performance.  In  this 
method,  the  BDRA  is  applied  simultaneously  to  only  the  two  relevant  classes 


85 


(i.e.,  class  1  and  class  2).  Thus,  feature  reduction,  and  selection,  using  BDRA(2) 
is  based  only  on  those  features  which  best  discriminate  class  1  from  class  2,  and 
without  directly  observing  the  null. 

The  results  shown  in  Figure  23  reveal  that  the  BDRA,  and  in  particular  one 
which  is  applied  to  only  the  relevant  classes,  is  able  to  improve  performance 
by  effectively  reducing  the  dimensionality  of  a  training  data  set  based  on  the 
empirical  statistics  of  that  data.  Therefore,  when  automatically  selecting  relevant 
features,  using  a  limited  amount  of  training  data,  it  is  important  to  measure  the 
impact  that  removal  of  irrelevant  features  has  on  overall  discrimination  capability 
amongst  the  relevant  classes. 

6.3  The  BDRA  Applied  to  the  Fusion  of  Features  From  Independent 
Sonar  Echoes 

In  this  section,  the  BDRA  is  used  for  the  fusion  of  features  extracted  from 
sonar  echoes  generated  by  independent  continuous  wave  (CW)  and  frequency 
modulated  (FM)  waveforms.  These  sonar  echoes  were  obtained  in  several  different 
littoral  environments,  and  their  purpose  is  to  track  and  detect  various  surface 
ships  and  submarines.  The  complete  data  set  used  here  represents  more  than 
two  thousand  pings,  which  is  a  total  time  duration  of  approximately  fifteen  hours. 
Notice,  correct  target  recognition  with  this  data  presents  an  interesting  challenge 
because  the  sample  size  of  the  nontarget  training  data  is  more  than  five  times 


86 


larger  than  that  of  the  target,  and  the  CW  waveform  is  a  much  better  detector 
than  the  FM  waveform. 

From  this  data,  five  features  were  extracted  and  formed  into  the  following 
feature  vector, 

{  Chi-square  Statistic,  CW  Doppler,  FM  Doppler,  CW  KLLR,  FM  KLLR  } 
where 

Chi-square  statistic  is  a  measure  of  track  similarity,  and  it  is  obtained  from 
the  normalized  (by  the  estimation  errors)  product  of  the  difference  between 
the  individual  CW  and  FM  track  state  estimates. 

CW  Doppler  appears  in  knots  and  is  measured  from  the  CW  processor. 

FM  Doppler  appears  in  knots  and  is  estimated  from  range  rate. 

CW  KLLR  is  the  Kinematic  log  likelihood  ratio  detection  statistic  for  CW  that 
is  based  on  track  innovation. 

FM  KLLR  is  the  Kinematic  log  likelihood  ratio  detection  statistic  for  FM  that 
is  based  on  track  innovation. 

Note,  the  track  state  is  a  four  dimensional  vector  of  position  and  velocity  esti¬ 
mates  in  both  coordinates,  and  track  estimation  error  is  a  4  x  4  matrix.  Also,  for 
each  waveform  target  track  estimation  is  performed  by  an  Interacting  Multiple 
Model  (IMM)  Kalman  Filter  (see  [4]). 


87 


Based  on  this  feature  vector,  the  data  was  partitioned  into  two  classes;  that 
is,  a  target  class  and  a  nontarget  class  (this  latter  class  is  made  up  of  background 
disturbances  such  as  shipping  noise  and  clutter).  The  target  class  consists  of  CW 
and  FM  features  in  which  at  least  one  waveform  has  been  verified  to  originate 
from  a  valid  target,  while  the  nontarget  class  is  made  up  of  only  nontarget  features 
from  each  waveform.  Notice,  in  order  to  correctly  label  the  data  identification 
of  true  targets  was  performed  by  comparing  estimated  tracks  to  those  of  the 
Global  Positioning  Satellite  (GPS).  Therefore,  any  track  not  identified  as  a  true 
target,  by  default,  automatically  was  labeled  a  nontarget.  With  this,  the  total 
training  set  size  was  5774  samples  of  which  Ntarget  =  848,  and  Nnontarget  =  4926. 
Actually,  the  original  data  contains  nearly  one  hundred  thousand  track  pairs  that 
can  be  considered  of  the  nontarget  category.  However,  a  form  of  track  pruning, 
or  gating,  was  employed  to  substantially  reduce  this  number  by  ordering  all  Chi- 
square  statistics.  That  is,  for  each  track  only  the  smallest  Chi-square  statistic 
was  accepted  (the  track  it  most  closely  associated  with),  and  all  other  larger  Chi- 
square  statistics  involving  this  track  were  rejected  (all  other  tracks  it  might  also 
have  been  associated  with). 

Before  applying  the  BDRA  to  this  data  it  was  necessary  to  threshold  each 
feature  into  an  initial  set  of  discrete  levels.  This  thresholding  was  based  on 
experience  examining  the  data,  and  as  a  result,  four  thresholds  were  chosen  for 
each  feature.  Thus,  with  four  discrete  levels  per  feature  the  initial  quantization 
complexity  of  this  data  was  M  =  1024.  Table  1  below  lists  these  thresholds  where 


88 


Table  1:  Threshold  Settings  for  Each  Feature  Before  Applying  the  BDRA 


Discrete  level 

Chi-square  statistic 

CW  &  FM  Doppler 

CW  &  FM  KLLR 

1 

7.78 

1 

2.3 

2 

50 

5 

6.8 

3 

O 

o 

1 — 1 

10 

20 

4 

100K 

70 

30 

at  each  discrete  level  the  upper  bound  is  shown,  and  the  lower  bound  is  defined 
in  the  next  lower  level. 

After  the  BDRA  was  applied  to  the  data  the  initial  quantization  complexity 
of  M  —  1024  was  reduced  to  a  final  quantization  complexity  of  M  =  8.  With  this, 
the  computed  empirical  probability  of  error  (see  formula  (18),  and  Figure  8  of 
Chapter  3)  was  reduced  from  0.325  to  0.117.  In  reducing  this  data,  it  was  found 
that  the  BDRA  completely  removed  the  FM  features.  Additionally,  it  reduced 
the  Chi-square  statistic,  CW  Doppler,  and  CW  KLLR  to  binary  valued  features 
keeping,  respectively,  the  thresholds  of  7.78,  1,  and  2.3.  Thus,  for  correct  target 
recognition  the  BDRA  prefers  to  rely  mostly  on  CW,  and  it  only  uses  FM  when 
it  associates  with  CW  through  the  Chi-square  statistic.  As  it  turns  out,  this  is 
consistent  with  the  fact  that  FM  is  known  to  perform  poorly  in  this  data,  and 
this  is  considered  further  in  Figure  24  below. 

Performance  results  of  applying  the  BDRA  for  fusing  CW  and  FM  features  are 
illustrated  in  Figure  24.  In  this  figure,  the  total  number  of  true  detected  targets 
versus  the  number  of  false  detections  per  hour  appears  for  the  BDRA  (note, 
results  for  the  BDRA  have  been  determined  by  testing  on  the  training  data), 


89 


80 


Figure  24:  Target  recognition  performance  comparison  of  the  BDRA  to  the  Chi- 
square  statistic,  and  an  OR  detector. 

the  Chi-square  statistic,  and  an  OR  detector  (this  detector  is  based  on  a  logical 
OR  of  the  individual  CW  and  FM  KLLR  detector  decisions).  All  results  shown 
in  Figure  24  represent  detected  target  tracks,  which  have  been  converted  from 
detected  target  pings  using  knowledge  of  the  average  number  of  pings  contained 
in  a  track.  It  can  be  seen  in  Figure  24,  that  for  low  rates  of  false  alert  (the  area  of 
most  interest)  the  BDRA  is  able  to  improve  performance  over  the  other  methods. 
Notice,  the  OR  detector  performs  poorly  due  to  the  high  false  alert  rate  of  FM. 
Also,  the  Chi-square  test  is  opportunity  limited  in  that  the  target  must  exist  in 
both  waveforms  in  order  for  this  test  to  detect  it.  On  the  other  hand,  the  BDRA 


90 


overcomes  these  limitations  by  selectively  choosing  those  features  associated  with 
best  performance. 

6.4  The  BDRA  Applied  to  the  Australian  Credit  Card  Data 

In  the  last  problem  addressed  in  this  chapter  the  BDRA  is  trained  and  tested 
on  the  Australian  Credit  Card  Data  (ACCD).  The  ACCD  is  a  data  set  that  is 
often  used  by  other  authors  for  trying  out  their  algorithms  (for  example,  see 
[63,  66]),  and  it  is  based  on  the  actual  credit  history  of  690  applicants  (307 
applicants  were  issued  credit,  and  383  were  denied  credit).  Also,  the  ACCD 
contains  fifteen  total  features  (six  continuous  and  nine  discrete),  as  well  as  some 
missing  values  (about  five  percent  of  the  feature  vectors  have  one  or  more  missing 
values).  Notice,  that  feature  definitions  are  not  supplied  with  this  data  in  order 
to  keep  them  confidential. 

The  ACCD  is  an  interesting  data  set  because  it  contains  characteristics  which 
make  classification  difficult.  That  is,  because  the  ACCD  contains  a  mix  of  fifteen 
discrete  and  continuous  valued  features,  including  missing  features,  a  total  of 
690  samples  (which  must  be  partitioned  into  training  and  test  data  sets  for  both 
classes)  is  a  relatively  small  data  set  for  classifying  accurately. 

Because  the  ACCD  contains  missing  features  the  methods  of  Chapter  5  are 
employed  when  applying  the  BDRA  to  the  data.  Specifically,  the  version  of  the 
BDRA  called  Method  2  in  Chapter  5  is  used  as  it  was  found  to  be  significantly  less 
computationally  intensive  with  high  dimensional  data  such  as  the  ACCD.  Recall, 


91 


Table  2:  Initial  Quantization  for  Each  Feature  of  the  ACCD 


Feature 

1 

2 

_3_ 

4 

_5_ 

6 

7 

_8_ 

9 

10 

11 

12 

13 

14 

15 

Levels 

T 

3 

~Y 

~5~ 

T 

15 

10 

2 

T 

2 

2 

2 

3 

3 

2 

Continuous(x) 

X 

X 

X 

X 

X 

X 

Missing(x) 

X 

X 

X 

X 

X 

X 

X 

in  Method  2  each  missing  value  of  a  feature  is  assigned  to  the  same  discrete 
level,  which  has  been  obtained  by  increasing  the  number  of  discrete  levels  for 
that  feature  by  one. 

To  apply  the  BDRA  to  the  ACCD  it  was  necessary,  as  in  the  previous  ap¬ 
plication,  to  discretize  the  continuous  valued  features.  However,  in  this  case  a 
method  of  percentiles  was  chosen  instead  of  using  predefined  values.  For  exam¬ 
ple,  to  obtain  binary  values  for  a  feature  the  threshold  was  found  (using  both 
training  and  test  data)  which  divided  its  sample  size  into  two  equal  parts.  As  it 
turns  out,  binary  values  were  chosen  for  each  continuous  valued  feature  because 
this  produced  the  best  classification  performance  for  the  ACCD.  That  is,  the 
probability  of  error  increased  with  the  number  of  discrete  levels  for  the  continu¬ 
ous  valued  features.  Further,  if  any  of  these  continuous  features  also  contained 
missing  values,  then  its  quantization  was  also  increased  to  be  ternary  valued. 

The  initial  quantization  for  each  of  the  fifteen  features  of  the  ACCD  is  given 
in  Table  2.  Notice,  also  shown  are  labels  for  those  features  which  were  contin¬ 
uous,  and  those  which  have  missing  values.  Thus,  based  on  Table  2  the  initial 
quantization  complexity  for  this  data  is  M  =  3.1104  x  107. 


92 


Table  3:  Performance  of  the  BDRA  and  a  Neural  Network  With  the  ACCD 


BDRA 

Neural  Network 

P(e) 

0.135 

0.284 

Vs 

0.020 

0.058 

In  applying  the  BDRA  to  the  ACCD  the  experimental  steps  shown  in  [66] 
were  followed.  That  is,  the  ACCD  was  randomly  partitioned  into  training  (518 
samples)  and  test  (172  samples)  sets.  Additionally,  all  results  shown  are  based  on 
an  average  of  thirty  independent  trials.  With  this,  the  BDRA  is  compared  to  the 
the  same  neural  network  which  was  used  in  Chapter  5.  However,  the  number  of 
nodes  in  the  input  layer  has  obviously  been  increased  to  fifteen,  and  the  number 
of  nodes  in  the  two  successive  hidden  layers  have  respectively  been  increased  to 
thirty  and  fifteen. 

In  Table  3,  performance  results  are  shown  for  the  BDRA  and  a  neural  network 
(the  neural  network  was  initialized  based  on  the  minimum  and  maximum  possible 
values  of  both  the  training  and  test  data).  It  can  be  seen  in  this  table  that  the 
BDRA  reduces  the  probability  of  error  by  more  than  half  as  compared  to  the 
neural  network,  and  it  also  shows  less  of  a  standard  deviation  in  the  error.  With 
this,  of  the  twenty  three  neural  network  algorithms  tested  in  [66]  using  the  ACCD 
the  BDRA  is  in  the  top  three  (best  performance  is  shown  for  an  evolutionary 
type  neural  network  with  P(e)  =  0.115).  Also,  the  BDRA  outperformed  the 
results  appearing  in  [63],  which  show  P(e )  =  0.143.  However,  in  each  of  these 
cases  the  authors  used  only  fourteen  of  the  fifteen  available  features,  thus,  direct 


93 


Table  4:  Final  Quantization  for  Each  Feature  of  the  ACCD 


Feature 

D 

H 

□ 

4 

5 

6 

DB 

□ 

12 

15 

Levels 

D 

B 

D 

ESI 

go 

1.7  | 

1.6 

X 

2 

1 

1.3 

1 

1 

1 

1  1 

comparisons  are  more  difficult  to  interpret.  Further,  the  CGLRT  of  Appendix  A 
was  used  in  place  of  the  CBT  in  the  BDRA  and  it  obtained  a  P(e)  of  0.157,  with 
VS  =  0.024. 

With  these  results,  Table  4  shows  the  final  average  quantization  the  BDRA 
produced  over  the  thirty  independent  trials.  As  evidenced  by  the  number  of  ones 
appearing  in  Table  4  the  BDRA  prefers  to  completely  eliminate  more  than  half  of 
the  available  features.  In  fact,  it  obtains  a  final  average  quantization  complexity 
of  M  =  71.8,  which  is  a  significant  reduction  of  the  data.  Additionally,  notice 
that  the  BDRA  always  keeps  the  ninth  feature,  and  it  only  finds  one  of  the 
continuous  features  (feature  eleven)  to  be  useful.  This  latter  case  helps  to  explain 
why  the  BDRA’s  performance  diminishes  as  the  quantization  is  increased  for  the 
continuous  features. 

Before  concluding  this  section  an  important  note  is  made  about  applying  the 
BDRA  to  the  ACCD,  and  this  is  relevant  to  any  other  potential  applications 
involving  training  data  sets  which  have  a  high  dimensionality  and  a  small  sample 
size.  In  the  initial  application  of  the  BDRA  to  the  ACCD  it  was  discovered  that 
the  empirical  probability  of  error  (see  formula  (18)  of  Chapter  3)  was  computed 
to  be  approximately  0.5,  and  this  was  subsequently  reduced  to  around  0.34.  In 
this  case,  the  BDRA  was  performing  poorly  with  the  ACCD.  However,  this  poor 


94 


performance  was  only  observed  when  the  summation  in  formula  (18)  was  taken 
over  all  possible  test  observations  (i.e.,  M  =  3.1104  x  107  discrete  symbols).  Thus, 
performance  was  dramatically  improved  for  the  BDRA  by  summing  formula  (18) 
only  over  those  test  observations  where  a  discrete  symbol  is  represented  in  at 
least  one  of  the  training  data  sets  of  each  class  (i.e.,  for  Ny  =  1  (extensions 
to  Ay  >  1  are  straightforward)  summing  only  over  those  y,  where  Xk,i  >  0  or 
xi,i  >  0).  This  also  required  that  formula  (19)  of  Chapter  3  be  redefined  as 
/  (yi  =  l|x^,  Hk',Xk,i  >  0  or  xiti  >  0),  which  essentially  amounts  to  renormalizing 
this  formula  so  that  it  sums  to  unity  over  the  range  of  i  where,  Xk,i  >  0  or 
xiti  >  0.  Observe  that  with  this  modification  a  typical  value  of  the  empirical 
probability  of  error  before  data  reduction  was  computed  to  be  0.33,  and  after 
applying  the  BDRA  it  was  reduced  to  0.14.  In  other  words,  even  though  the 
initial  quantization  complexity  of  the  data  is  more  than  thirty  one  million  discrete 
symbols,  by  employing  the  modification  described  above  it  is  necessary  for  the 
BDRA  to  only  consider  those  discrete  symbols  relevant  to  the  training  data. 
This  helps  to  eliminate  the  curse  of  dimensionality,  while  lessening  the  number 
of  computations  required  by  the  BDRA. 

6.5  Summary 

In  this  chapter,  the  BDRA  was  applied  to  several  interesting  problems  in 
classification.  With  the  first  application,  it  was  shown  that  when  class-specific 
features  are  created  for  training  data  in  an  ad  hoc  manner  (and  with  relatively 


95 


small  sample  sizes),  the  BDRA  can  be  used  to  reduce  the  data  for  improved 
classification  performance.  In  fact,  its  performance  was  shown  to  be  superior 
to  the  class-specific  classifier.  Also,  in  the  second  application  the  BDRA  was 
used  for  fusing  feature  information  from  sonar  echoes  which  were  produced  by 
independent  CW  and  FM  waveforms.  In  this  case,  the  BDRA  was  shown  to 
be  more  effective,  at  correct  target  recognition,  than  a  Chi-square  statistic  test 
and  an  OR  detector.  The  final  application  involved  applying  the  BDRA  to  the 
Australian  credit  card  data,  and  in  this  case  the  BDRA  obtained  a  probability 
of  error  that  was  less  than  half  that  obtained  by  the  same  neural  network  used 
in  previous  chapters. 


96 


Appendix  A 


Results  Using  Empirically  Generated  Data 


Simulated  performance  results  are  shown  below  for  the  CBT  and  the  SGLRT 
of,  respectively,  formulas  (4)  and  (10)  of  Chapter  2,  and  the  CGLRT  given  by 


/  (xfc,x/,y|p&, 

/  (xfc,x/,y|pfc,  p/,  Hi) 


M  ***,»+?«  **l,t  Hk 

nPk,i,Hk  Pl,i,Hk  > 

*?k,i  < 

*= 1  Pk,i,HiPl,i,Hl  Ht 


r 


(41) 


where  symbol  probabilities  for  the  ith  are  obtained  from  ML  estimates  (see,  [12]) 
based  on  the  training  and  test  data,  or, 


a  xk,i  +  Vi  a  _  ^i±+yi_ 

Pk,i',Hk  Nk  +  Ny  ’  Pl,i’H‘  ~  Nl  +  Ny  ’ 

Pk,i;H,  =  and  Pl,i;Hk  =  jj-.  (42) 

Results  for  the  CBT,  SGLRT,  and  the  CGLRT  are  displayed  using  operating 
characteristic  (OC)  curves  (i.e.,  the  probability  of  correct  recognition  (Pd)  versus 
the  probability  of  false  recognition  (Pfa)),  and  the  following  items  list  specific 
information  about  the  simulations: 

•  Each  OC  result  is  based  on  10,000  independent  iterations. 

•  The  number  of  quantizing  symbols  is  M  =  8. 

•  At  each  iteration  the  symbol  probabilities  are  generated  according  to  a 
multivariate  uniform  distribution,  and  using  these  the  training  and  test 
data  are  generated. 

•  To  generate  the  training  and  test  data  the  symbol  probabilities  above  are 
modified  in  the  Mth  symbol  probability. 

•  Pclass  l,M(training)  ~~  0.005  and  P class  =  0.125. 


97 


•  The  training  data  sizes  are  Nciass  i  =  50  and  Nciass  2  =  250. 

•  Two  test  observation  sizes,  Ny,  of  2  and  25  are  shown. 

•  The  optimal  curves  in  the  OC  are  found  by  using  the  actual  symbol  prob¬ 
abilities  of  the  test  data. 

Notice,  the  Mth  symbol  probability  is  modified  to  reflect  a  possible  mismatch 
between  the  training  data  of  class  1  and  the  test  data,  and  indeed  it  is  here  that 
the  main  differences  between  the  tests  are  to  be  found. 


Figure  25:  Simulated  performance  comparison  of  the  CBT,  CGLRT,  and  the 
SGLRT  where  Ny  =  2. 

In  Figure  25  (above)  and  Figure  26  (below)  two  OC  plots  appear  for  the  case  of 
Pdass  I M {training)  =  0.005  and  pc!ass  i ,M(test)  =  0.125.  Also,  Figure  25  represents  the 
situation  in  which  Ny  =  2,  while  that  of  Figure  26  represents  Ny  =  25.  Notice, 
in  both  of  these  figures  a  relatively  severe  mismatch  between  the  testing  and 
training  distributions  exists  and,  as  a  result,  performance  for  all  tests  fall  below 
optimal.  However,  because  the  CBT  and  the  CGLRT  use  test  observations  to  help 
infer  the  true  symbol  probabilities,  both  of  these  tests  outperform  the  SGLRT. 
Observe,  this  is  particularly  true  in  Figure  26  (Ny  =  25),  where  the  SGLRT’s 


98 


low  estimate  of  pciass  i,m (training)  is  causing  it  to  make  classification  errors  as  the 
frequency  of  symbol  M  increases  in  the  test  data.  Also,  it  can  be  seen  that 
the  SGLRT  does  not  appear  concave,  and  this  is  due  to  poor  estimates  of  the 
symbol  probabilities.  Further,  both  of  these  figures  demonstrate  the  similarity 
between  the  CBT  and  the  CGLRT,  and  their  performance  improvement  with  test 
observations  as  predicted  by  Figure  4  of  Chapter  2. 


Figure  26:  Simulated  performance  comparison  of  the  CBT,  CGLRT,  and  the 
SGLRT  where  Ny  =  25. 

In  general,  the  trend  in  performance  appearing  in  Figures  25  and  26  continues 
as  the  difference  between  Pciass  i,m (training)  and  Pciass  i (test)  increases.  That  is, 
performance  of  the  SGLRT  falls  off  more  rapidly  than  either  the  CBT  or  the 
CGLRT ,  and  this  is  because  the  combined  tests  are  able  to  extract  distributional 
information  from  the  test  observations.  But,  it  is  also  noted  that  if  the  mis¬ 
matched  symbol  appears  substantially  less  often  in  the  test  data  than  predicted 
by  the  training  data  (e.g.,  Pciass  i,M(training)  ~  0.125  and  Pciass  i ,M (test)  =  0.005), 
the  performance  loss  in  all  three  tests  is  not  as  severe. 

In  Figure  27  below  empirical  results  are  given  for  comparing  performance  of 
the  CBT,  formula  (14)  in  Chapter  2,  to  that  of  the  KST,  formula  (15)  of  the 
same  chapter.  In  this  figure,  two  curves  appear  for  each  test  where  Nciass  i  = 


99 


Nciass  2  =  16,  and  iVc,as5l  =  NciaSs  2  =  50.  Also,  there  are  a  total  of  M  =  8 
discrete  symbols,  and  the  results  are  based  on  10,000  independent  trials  where, 
at  each  trial,  the  true  symbol  probabilities  are  Dirichlet  distributed.  With  this, 
the  axis  labels  are  the  probability  of  declaring  statistical  similarity  when  true 
(Pd)  versus  the  probability  of  declaring  statistical  similarity  when  false  (Pfa). 
Clearly,  it  is  apparent  in  this  figure,  and  consistent  with  Figure  7  of  Chapter  2, 
that  the  CBT  overall  performs  better  than  the  KST.  Also,  it  can  be  seen  that, 
as  expected,  the  performance  of  both  tests  becomes  similar  as  the  sample  size 
increases. 


Figure  27:  Simulated  performance  comparison  of  the  CBT  and  the  KST. 


100 


Appendix  B 


Development  of  the  Combined  Bayes  Test 


The  CBT  of  formula  (4),  Chapter  2,  is  based  on  solving  an  integral  expression 
of  the  type  given  by  (this  integral  is  solved  for  class  k  true,  and  for  class  l  recall 
from  formula  (1)  that  k  and  l  are  exchangeable) 


f(xk,xi,y\Hk)=  f  I  f{xk,Xhy\Pk,Pi,Hk)f(pk)f(p,)dpkdph  (43) 

jpk  Jpi 

with  each  integration  taken  over  the  M  dimensional  unit  simplex,  [51]. 

Now,  because  of  independence  assumptions  (see  formula  (1)  of  Chapter  2), 
formula  (43)  can  be  rewritten  as 


f  (xk,xhy\Hk)  =  f  f(xk,y\pk,Hk)f(pk)dpk  f  f  (x/|pz,  Hk)  f  (p,)  dp, 

jpk  j  p, 

where  (notice  in  formula  (46)  Xj  is  independent  of  Hk) 

M  jfk.i+Vi 

f  (x*,y|pt,  Hk)  =  AVJV  n  iiV-r 

M 

f  (x,|p,,  Hi)  =  /  (xijpi)  =  Nil n  ~~1 

(=1  *1,  i' 


(44) 


(45) 

(46) 


and  (note,  for  /  (p/)  simply  substitute  l  for  kin  f(pk)) 


f  (Pfc)  -  (M  -  Pk  i=1y  (47) 

Observe  that  /  (p*)  is  uniformly  distributed  and  represented  by  a  Dirichlet 
distribution,  [29,  33],  which  is  also  called  the  multivariate  beta  density,  [45,  51]. 
In  general,  the  form  of  this  distribution  is  nonuniform  (see  formula  (3)  of  Chapter 
2)  except,  as  in  this  case,  when  its  parameters  are  selected  to  be  unity,  [8].  With 


101 


this,  the  marginal  probability  formulas  of  the  uniform  Dirichlet  are  given  by 
(these  formulas  are  used  below), 


/  {Pk,i\Pk,i+li  Pk,i+2i  •  •  •  iPk,M )  — 

(  (M-  1)  (1  -  pk,i)M~^l{(  0,1)} 

<  7 - ^ - rfl-— - ]  Xu  \\  1  <i  <  M (48) 


i  —  M 


i  =  1 


Now,  the  first  integral  in  formula  (44)  is  worked  on  next,  and  after  factoring 
/  (p*)  using  formula  (48)  it  is  given  by 

/(x*,y|ff*)  =  /  f(xk,y\Pk,Hk)f(pkti\pk,2,..-,Pk,M) 

JPk 

xf  ( Pk,2\Pk,Z ,  Pk,M )  •  •  •  /  ( Pk,M )  dpk.  (49) 

€ontinuing,  formulas  (45)  and  (48)  are  used  in  the  first  two  distributions  after 
the  integral  sign  of  formula  (49),  and  then  the  integral  over  p*  is  broken  down 
into  M  separate  integrals  producing 


/(x*,yifr*)  =  r...r 

Jo  Jo 


Nk'.Ny'. 


[Z{E",  ».<=■}] 


r  M 

.2=1 

xf  (Pk,2\Pk,3,  •  •  •  ,Pk,Al)  ■  •  *  /  (Pk,M)  dpk,  1  •  . 


•  dpkr 


M- 


(50) 


This  is  integrated  with  respect  to  pk, i  yielding 


=  I  -I 


1  r'-EL  NkiNy\ 


Uiii  Xk,i'-Vi[ 


r  m  i 

'/  M  \*M+»1' 

TIpT" 

(i -£>*,;) 

.2=2 

.  V  2=2  / 

xf  (pk,  2  |P/:,3)  .  •  • ,  Pk,M)  •  •  •  /  (?*,m)  •  •  •  dpk,M-  (51) 

From  here,  formula  (48)  is  used  in  /  (pk,2  |pi,3,  •  •  • ,  Pk,Af)  of  formula  (51)  so 


that 


102 


f(xk,y\Hk)  =  f  ...  f 
Jo  Jo 


NklNyl 


r  m 


n* 


xh,i+yi 


U= 2 


[n£i 

(  M  \xk,i+vi 

1 

v,  t=2  / 


^  f  (Pfc,3  |,P/:?4?  *  •  •  5  Pk,Af)  •  •  •  f  &pk,2  •  •  •  ^ Pk,M  • 


(52) 


Before  integrating  formula  (52)  with  respect  to  p*i2  an  integral  expression  is 
required  that  is  given  by 


r  ^  - w 

Jo 


wbdw  =  Aa+b+1 


a\b\ 


(a  +  6+1)! 


(53) 


and  using  this  to  do  the  integration  results  in 

/ (xfc,y|J3jb)  =  /  ...  / 

Jo  Jo 


Pk,i 

NklNyl 

1  Jo 

li 

[igi  *»„•!».•!] 

M  1 

M  \  +2/1 +^,2+2/2’ 

TT  *jm+s« 

IIP*,* 

f1- 

U=3  J 

\ 

*=3  / 

x  (^M  +  yi)!(gA!,2  +  y2)! 

(**,i  +  j/i  +  xkj2  +  yi  + 1)! 

x/  fe,3|^,4)  •  •  •  ,  P1c,m)  •  ■  •  /  (Pfc.A/)  <^,3  •  • .  dpk,M-  (54) 

Now,  the  procedure  employed  in  formulas  (51)  through  (54)  is  repeated  with 
respect  to  pk, 3  producing 


f1 

Jo 

Jo 

1 

X 

n*,+* 

J=4 

v  - 

(*w  +  yi) 

=/.../ 


*-E£«p*.i  3x2 NklNyl 
rifei  ®*,.-!y«!] 


[ng,  *«!»!] 

f  M  \  xk,i+yi+xk,2+y2+zk,3+yzm 

1  ~Y,pk,i) 

\  i=4  / 

;fc,2  +  2/2)!  (^,3  +  jfe)! 

(xk,l  +  2/1  +  2^,2  +  2/2  +  xk,  3  +  2/3  +  2)! 

X/  (.Pfc,4  \Pk,h  1  *  *  *  ?P&,m)  •  •  •  f  ( Pk,M )  ^Pk,4  •  •  •  4 Pk,M  (55) 

and  likewise  using  the  same  procedure  with  respect  to  p^,4  yields 


103 


r  m  1 

"/  M  \Zkfi+yi+Xk}2+y2+Xk,Z+yZ+XkJ4+y4~ 

X 

aw* 

(  1  -  Y^Pk,i) 

.*= 5 

\  t=5  / 

(gM  +  yi)?  (fu  +  3/2)!  (a*, 3  +  2/3)?  (£m  +  3/4)! 

(®fc,i  +  J/i  +  %k,2  +  2/2  +  Xfc,3  + 1/3  +  + 1/4  +  3)! 

Xf(Pk,5\Pk,6, .  •  • , P^.m)  •  •  •  /  (j>Mf  )  dpk,5  .  .  .  (56) 

Continuing  in  this  way  a  final  result  is  obtained,  which  is  given  by 


f(xk,y\Hk) 


(M  -  l)\Nk\Ny\ 

rifei  (xk,t + yd[ 

nfc  *kM\ 

\  [(£& 

1  (xk,i  +  Vi)  +  M 

-1).] 

(M-l)\Nk\Ny\  Mixkj  +  yi)'. 
(Nk  +  Ny  +  M  -  1)!  l[  *w!jfc! 


(57) 


Returning  now  to  formula  (44),  the  second  integral  expression  is  evaluated  by 
integrating  as  before,  and  this  results  in  (where,  f  (xi\Hk)  =  f  (x*),  see  formula 
(46)) 


/(x<) 


(M  -  l)!iV>!  [jgi  fwj] 

[n£W]  [(Dfei  (*/,»•)  +  m  —  1)  i] 

(m  —  i)\Ni\ 

~  (Ni  +  M-  1)! 


(58) 


A  few  notes  about  the  formulas  in  (57)  and  (58)  are  necessary.  First,  the 
result  in  formula  (58)  can  be  found  by  simply  eliminating  any  test  observations 
in  formula  (57).  Also,  notice  that  without  test  observations  /  (x/)  in  formula  (58) 
is  uniformly  distributed.  In  other  words,  the  likelihood  of  all  training  data  being 
the  same  symbol  is  equal  to  them  being  different  symbols.  However,  with  test 
observations,  the  same  procedure  produces  a  nonuniform  result  for  f  (xk,y\Hk) 
as  shown  in  formula  (57).  That  is,  formula  (57)  is  distributed  according  to  the 
number  of  occurrences  of  training  and  test  data. 

Using  formulas  (57)  and  (58)  the  integral  expression  in  formula  (44)  can  then 
be  solved,  or 


v  W  - 1)!]2  mWNyl  "  (*tli  + 1«)! 

(Nt  +  Ny  +  M_1y(Nl  +  M_1yU.  ,W|V(| 

where  the  equivalent  formula  for  class  l  (i.e.,  Hi)  is  obtained  by  a  substitution 
of  l  for  k.  Finally,  the  combined  Bayes  test  of  formula  (4)  in  Chapter  2  then 
becomes  the  ratio  of  formula  (59)  to  its  equivalent  formula  under  class  l. 


104 


Appendix  C 


Development  of  /  (y  |  x&,  Hk)  from  Probabilistic 

Considerations 


In  Chapter  3,  it  was  previously  mentioned  that  the  distribution  /  (y|x^,  Hk) 
shown  as  formula  (17)  was  required  for  the  error  probability  formula  of  (18). 
Here,  this  distribution  is  developed  form  probabilistic  considerations,  and  to  do 
this,  a  random  vector  is  first  defined  which  is  the  sum  of  the  unknown  test  vector 
and  the  kth  training  data  set,  or,  s  =  y  +  x*. 

Now,  under  a  Dirichlet  prior  for  the  symbol  probabilities  given  by 

/  (P*)  =  (M  —  l)!-^{^^iPfci=i}  (^0) 

the  distribution  of  s  can  be  determined.  However,  before  applying  the  Dirichlet, 
the  distribution  of  s  conditioned  on  Pk  and  Hk  is  multinomial  and  it  appears  as 

M 

f  (®  I  Pfc.  Hk)  =  (iVfc  +  Ny)\  n  •  (61) 

i=l  5*- 

Then,  after  applying  the  Dirichlet  the  distribution  of  s  is  given  by 

1 

+  Ny  +  M  —1  y 
Nk  +  Ny  ) 

Notice,  as  in  formula  (8)  of  Chapter  2,  the  notation  ^  ^  j  means 

the  number  of  ways  that  Nk+Ny  samples  can  be  arranged  amongst  M— 1  discrete 
symbols.  Also,  the  result  in  formula  (62)  means  that  s  is,  with  equal  probability, 
any  valid  value. 

Continuing,  conditioned  on  s,  any  way  of  choosing  the  Ny  test  observations 
is  also  equiprobable,  and  this  has  the  distribution 


105 


nSi 


/( y  I  s,Hk)  = 


Si 

Vi 


Nk  +  Ny 
Nv 


It  is  also  apparent  from  the  definition  of  s  that 

f(xk  |  y,s,Hk)  =l{s=y+Xk}. 

The  distributions  of  formulas  (62),  (63),  and  (64)  result  in 


/  (x*,y,s  |  Hk)  = 


^{s=y+Xfc}  nj=i  [  1  ) 
_  \  ) 


Nk  +  Ny 


Nv 


Nk  +  Ny  +  M  -  1 
Nk  +  Ny 


From  which  we  obtain 


/(Xit,y  |  Hk )  = 


Now,  noting  that  /  (x*)  = 


rr m  (  xk ,i  +  Vi  \ 
1,-1  \  Vi  ) 


Nk  +  Ny\ 

A 

'  Nk  +  M-  1 
Nk 


(63) 


(64) 


(65) 


(66) 


Nk  +  Ny  +  M-1  \ 

Nk  +  Ny  ) 

^  (see  formula  (58)  of  Appendix 


B),  the  desired  distribution  is  found,  or, 


f  /  | ..  TT  ^  _  f  (xk,y  \  Hk) 

f(y\Kt,Hk) - jj-y- 

y\M  (  xk,i  +  Vi  \  /  Nk  +  M  -  1  \ 
a=1  {  Vi  )_  j  Nk  )_ 
(  Nk  +  Ny  \(  Nk  +  Ny  +  M  -  1  \ 

l  Ny  A  Nk  +  NV  ) 

T-f M  (  xk,i  +  Vi  \ 

a=1  v  y_i  l 

l  Ny  ) 

(-/Vfc  +  M  —  1)!  (-/Vy)!  (xktj  +  yt-)! 
(Nk  +  Ny  +  M  —  1)\  (a;*,i)-  (y*)! 

which  is  the  same  as  formula  (17)  of  Chapter  3. 


106 


Appendix  D 


Mean  and  Variance  of  the  Probability  of  Error 
for  Dirichlet  Distributed  Symbol  Probabilities 


Under  the  assumption  of  an  optimal  test,  and  that  there  are  two  classes  labeled 
k  and  /,  the  test  chooses  k  if  y  =  i  and  pk,i  >  pij.  Thus,  for  the  probability  of 
error  we  have  (see,  [34]) 


M 

P{e\k)  =  ]TP(e,y  =  i|fc)  (68) 

.  *=  1 

=  MP(e,y  =  l\k)  (69) 

=  MPr  (pit i  >  pk,i,y  =  1  |  k)  (70) 

where  formulas  (68)  and  (69)  result  from  total  probability  and  symmetry  in  the 
probabilities,  and  formula  (70)  is  based  on  the  definition  of  an  error  under  k. 
But,  formula  (70)  is  also  equivalent  to 

~  M  Jo  Pr  (piA  >  pk’1,y  =  1  I  Pk'u  ?  (^.0  dPk,i  (71) 

=  M  f  Pr  (Pl,i  >  Pk, i  I  V  =  l,Pk,i ,k)Pr(y  =  1  |  pkjl ,  k)  f  (pk, i )  dpk,i  (72) 

where  formulas  (71)  and  (72)  result  from  another  application  of  total  probability, 
and  conditional  independence. 

Now,  using  the  definitions  (see  formula  (48)  of  Appendix  B), 


/(pm)  =  (M-1)(1-Pm)"-! 

(73) 

"a 

-3 

ll 

1  Vk:i,k)  =  Pk ,1 

(74) 

Pr  (pt<  1  >  Pk,i,y  =  1 

|PM.*)  =  /’  (M  —  1)(1—  pk,i)M~2 

JPk,l 

=  (1  —  Pk,\)M~l 

(75) 

107 


formula  (72)  then  becomes 


=  M  ['(l-puf-'ptoiM-Wl-puf-’dpk,!  (76) 

Jo 

=  M  (M  -  1)  £  (1  -  pk,i)2M~3pk,idpktl  (77) 


Note,  formula  (52)  of  Appendix  B  was  used  in  obtaining  formula  (78),  and  it  is 
clear  in  this  result  that  under  a  uniform  Dirichlet  distribution,  as  M  approaches 
infinity,  the  quantity  P  (e\k)  approaches  1/4. 

Using  the  results  above,  the  variance  of  P  (e| k)  can  be  determined  by  first 
finding 

+M(M  —  1)/  (1  -  1)  (1  - Pj,i)M~2 dpj,t 

X  £  (1  -  pm)"'1  Pm  (M  -  1)  (1  -  Pfci)"'2  dp M  (79) 

=  2  M  M(M-l) 

3(3M-l)(3M-2)  4(2Af-l)2  '  ’ 

and  after  subtracting  from  formula  (80)  the  formula  of  (78)  squared,  produces 
the  result 


v*iP(e\k)  =  2  -  \  ■  («) 

With  this,  it  can  be  seen  that  as  M  approaches  infinity  the  variance  of  P  (e  |  k ) 
approaches  a  limit  of  zero. 

The  result  in  formula  (78)  can  be  straightforwardly  extended  to  three  classes 
(i.e.,  C  =  3)  be  redefining  formula  (70)  as 


P(e\k-,C  =  3)  =  M(l-Pr(pjtl<pk>1,pi>1<pk,uy=l\k))  (82) 

and  using  formulas  (73),  (74),  and 

Pr(pitl  <pk,!,y  =  l  lpk,i,k)  =  Jq  ’  (M  -  1)  (l  -  pk<1)M~2 

=  1-(1-pm)M_1  (83) 


108 


formula  (82)  becomes 


=  M  Jo  (1_(1-P*,i)M  *)  Pk,i{M-l)(l  -pk,i)M  2dPk>1  (84) 
M  1  (  M  \ 

(2M-1)  3  V3M  —  2/  ’  ^ 

Observe,  that  in  the  limit  as  M  approaches  infinity  P  {e\k\  (7  =  3)  approaches 
7/18.  Thus,  as  compared  to  formula  (78),  the  limit  of  the  average  probability 
of  error  increases  with  the  number  of  classes  under  Dirichlet  distributed  symbol 
probabilities. 


109/(1 10  blank) 


Bibliography 


[1]  K.  Abend  and  T.  J.  Harley,  Jr.,  “Comments  on  ’‘The  Mean  Accuracy  of  Sta¬ 
tistical  Pattern  Recognizers’,”  IEEE  Transactions  on  Information  Theory , 
vol.  15,  May  1969,  pp.  420-421. 

[2]  P.  M.  Baggenstoss,  “Class-Specific  Feature  Sets  in  Classification,”  To  appear 
in  a  future  issue  of  the  IEEE  Transactions  on  Signal  Processing. 

[3]  B.  Baygun  and  A.  0.  Hero  III,  “Optimal  Simultaneous  Detection  and  Esti¬ 
mation  Under  a  False  Alarm  Constraint,”  IEEE  Transactions  on  Informa¬ 
tion  Theory ,  vol.  41,  no.  3,  May  1995,  pp.  688-703. 

[4]  Y.  Bar-Shalom  and  X.  Li,  Multitarget-Multisensor  Tracking:  Principles  and 
Techniques ,  Course  Notes,  University  of  Connecticut,  1995. 

[5]  J.  M.  Bernardo  and  A.  F.  M.  Smith,  Bayesian  Theory ,  Wiley,  New  York, 
1994. 

[6]  T.  G.  Birdsall  and  J.  0.  Gobien,  “Sufficient  Statistics  and  Reproducing  Den¬ 
sities  in  Simultaneous  Sequential  Detection  and  Estimation,”  IEEE  Trans¬ 
actions  on  Information  Theory ,  vol.  19,  no.  6,  November  1973,  pp.  760-768. 

[7]  C.  M.  Bishop,  Neural  Networks  for  Pattern  Recognition ,  Clarendon  Press, 
Oxford,  1995. 

[8]  C.  G.  E.  Boender  and  A.  H.  G.  Rinnooy  Kan,  “A  Multinomial  Bayesian  Ap¬ 
proach  to  the  Estimation  of  Population  and  Vocabulary  Size,”  Biometrika , 
vol.  74,  no.  4,  1987,  pp.  849-856. 

[9]  L.  J.  Buturovic,  “Toward  Bayes-Optimal  Linear  Dimension  Reduction,” 
IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence ,  vol.  16, 
no.  4,  April  1994,  pp.  420-423. 

[10]  L.  L.  Campbell,  “Averaging  Entropy,”  IEEE  Transactions  on  Information 
Theory,  vol.  41,  no.  1,  January  1995,  pp.  338-339. 


Ill 


[11]  B.  P.  Carlin  and  T.  A.  Louis,  Bayes  and  Empirical  Bayes  Methods  for  Data 
Analysis ,  Chapman  k  Hall,  London,  1996. 

[12]  G.  Casella  and  R.  L.  Berger,  Statistical  Inference ,  Duxbury  Press,  Belmont, 
California,  1990. 

[13]  B.  Chandrasekaran,  “Independence  of  Measurements  and  the  Mean  Recog¬ 
nition  Accuracy,”  IEEE  Transactions  on  Information  Theory ,  vol.  17,  July 
1971,  pp.  452-456. 

[14]  B.  Chandrasekaran  and  T.  J.  Harley,  Jr.,  “Comments  on  ’‘The  Mean  Accu¬ 
racy  of  Statistical  Pattern  Recognizers’,”  IEEE  Transactions  on  Information 
Theory ,  vol.  15,  May  1969,  pp.  421-423. 

[15]  M.  H.  DeGroot,  Probability  and  Statistics,  Addison- Wesley  Publishing  Com¬ 
pany,  Reading,  Massachusetts,  August  1987. 

[16]  M.  Delampady  and  J.  0.  Berger,  “Lower  Bounds  on  Bayes  Factors  for 
Multinomial  Distributions,  With  Applications  to  Chi-Square  Tests  of  Fit,” 
The  Annals  of  Statistics,  vol.  18,  no.  3,  1990,  pp.  1295-1316. 

[17]  H.  Demuth  and  M.  Beale,  Neural  Network  Toolbox,  The  Math  Works,  Inc., 
Natick,  MA,  1994. 

[18]  L.  Devroye,  L.  Gyorfi,  and  G.  Lugosi,  A  Probabilistic  Theory  of  Pattern 
Recognition,  Springer- Verlag,  New  York,  NY,  1996. 

[19]  P.  Dianconis  and  D.  Freedman,  “On  the  Uniform  Consistency  of  Bayes 
Estimates  for  Multinomial  Probabilities,”  The  Annals  of  Statistics,  vol.  18, 
no.  3,  1990,  pp.  1317-1327. 

[20]  R.  D.  Dony  and  S.  Haykin,  “Neural  Network  Approaches  to  Image  Compres¬ 
sion,”  Proceedings  of  the  IEEE,  vol.  83,  no.  2,  February  1995,  pp.  288-303. 

[21]  R.  P.  W.  Duin,  “The  Mean  Recognition  Performance  for  Independent  Dis¬ 
tributions,”  IEEE  Transactions  on  Information  Theory,  vol.  24,  no.  3,  April 
1978,  pp.  394-395. 

[22]  R.  P.  W.  Duin,  C.  E.  van  Haersma  Buma,  and  L.  Roosma,  “On  the  Evalu¬ 
ation  of  Independent  Binary  Features,”  IEEE  Transactions  on  Information 
Theory,  vol.  24,  no.  2,  March  1978,  pp.  248-249. 

[23]  R.  O.  Duda  and  P.  E.  Hart,  Pattern  Classification  and  Scene  Analysis,  John 
Wiley  k  Sons,  New  York,  NY,  1973. 

[24]  Y.  Ephraim,  “Statistical-Model-Based  Speech  Enhancement  Systems,”  Pro¬ 
ceedings  of  the  IEEE,  vol.  80,  no.  10,  October  1992,  pp.  1526-1555. 


112 


[25]  K.  Fukunaga,  Statistical  Pattern  Recognition ,  Academic  Press,  Inc.,  Boston, 
1990. 

[26]  K.  Fukunaga  and  R.  R.  Hayes,  “Effects  of  Sample  Size  in  Classifier  Design,” 
IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence ,  vol.  11, 
no.  8,  August  1989,  pp.  873-885. 

[27]  K.  Fukunaga  and  D.  Kessell,  “Nonparametric  Bayes  Error  Estimation  Using 
Unclassified  Samples,”  IEEE  Transactions  on  Information  Theory ,  vol.  19, 
1973,  pp.  434-440. 

[28]  I.  J.  Good,  “The  Bayes  Factor  Against  Equiprobability  of  a  Multinomial 
Population  Assuming  a  Symmetric  Dirichlet  Prior,”  The  Annals  of  Statis¬ 
tics ,  vol.  2,  no.  5,  1974,  pp.  977-987. 

[29]  I.  J.  Good  and  J.  F.  Crook,  “The  Robustness  and  Sensitivity  of  the  Mixed- 
Dirichlet  Bayesian  Test  for  “Independence”  in  Contingency  Tables,”  The 
Annals  of  Statistics,  vol.  15,  no.  2,  1987,  pp.  670-693. 

[30]  M.  Gutman,  “Asymptotically  Optimal  Classification  for  Multiple  Tests  with 
Empirically  Observed  Statistics,”  IEEE  Transactions  on  Information  The¬ 
ory,  vol.  35,  no.  2,  March  1989,  pp.  401-407. 

[31]  R.  Hanson,  J.  Stutz,  and  P.  Cheeseman,  “Bayesian  Classification  Theory,” 
NASA  Ames  Research  Center  Technical  Report,  no.  FIA-90-12-7-01,  Decem¬ 
ber  1990. 

[32]  J.  P.  Hoffbeck  and  D.  A.  Landgrebe,  “Covariance  Matrix  Estimation  and 
Classification  With  Limited  Training  Data,”  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence,  vol.  18,  no.  7,  July  1996,  pp.  763-767. 

[33]  R.  V.  Hogg  and  A.  T.  Craig,  Introduction  to  Mathematical  Statistics,  Pren¬ 
tice  Hall,  Inc.,  Englewood  Cliffs,  New  Jersey,  1995. 

[34]  G.  F.  Hughes,  “On  the  Mean  Accuracy  of  Statistical  Pattern  Recognizers,” 

IEEE  Transactions  on  Information  Theory,  vol.  14,  no.  1,  January  1968,  pp. 
55-63.  . 

[35]  Q.  Huo,  H.  Jiang,  and  C.  Lee,  “A  Bayesian  Predictive  Classification  Ap¬ 
proach  to  Robust  Speech  Recognition,”  Proceedings  of  the  IEEE  Interna¬ 
tional  Conference  on  Acoustics,  Speech,  and  Signal  Processing,  April  1997, 
pp.  1547-1550. 

[36]  A.  G.  Jaffer  and  S.  C.  Gupta,  “Coupled  Detection-Estimation  of  Gaussian 
Processes  in  Gaussian  Noise,”  IEEE  Transactions  on  Information  Theory, 
vol.  18,  no.  1,  January  1972,  pp.  106-110. 


113 


[37]  A.  Jain  and  D.  Zongker,  “Feature  Selection:  Evaluation,  Application,  and 
Small  Sample  Size  Performance,”  IEEE  Transactions  on  Pattern  Analysis 
and  Machine  Intelligence ,  vol.  19,  no.  2,  February  1997,  pp.  153-158. 

[38]  R.  E.  Kass  and  A.  E.  Raftery,  “Bayes  Factors,”  Journal  of  the  American 
Statistical  Association,  vol.  90,  no.  430,  June  1995,  pp.  773-795. 

[39]  D.  Kazakos,  “Quantization  Complexity  and  Training  Sample  Size  in  De¬ 
tection,”  IEEE  Transactions  on  Information  Theory ,  vol.  24,  no.  2,  March 
1978,  pp.  229-237. 

[40]  J.  Kittler,  “Feature  Set  Search  Algorithms,”  Pattern  Recognition  and  Signal 
Processing ,  C.  H.  Chen,  ed.,  SijthofF  and  Noordhoff,  Alphen  aan  den  Rijn, 
The  Netherlands,  1978,  pp.  41-60. 

[41]  G.  E.  Kokolakis,  “Bayesian  Classification  and  Classification  Performance 
for  Independent  Distributions,”  IEEE  Transactions  on  Information  Theory, 
vol.  27,  no.  4,  July  1981,  pp.  500-502. 

[42]  R.  E.  Krichevsky  and  V.  K.  Trofimov,  “The  Performance  of  Universal  En¬ 
coding,”  IEEE  Transactions  on  Information  Theory,  vol.  27,  no.  2,  March 
1981,  pp.  199-207. 

[43]  Q.  Li  and  D.  W.  Tufts,  “Principal  Feature  Classification,”  IEEE  Transac¬ 
tions  on  Neural  Networks,  vol.  8,  no.  1,  January  1997,  pp.  155-160. 

[44]  S.  R.  Kulkarni  and  O.  Zeitouni,  “A  General  Classification  Rule  for  Probabil¬ 
ity  Measures,”  The  Annals  of  Statistics,  vol.  23,  no.  4,  1995,  pp.  1393-1407. 

[45]  J.  J.  Martin,  Bayesian  Decision  Problems  and  Markov  Chains,  John  Wiley 

Sons,  Inc.,  New  York,  1967. 

[46]  N.  Merhav  and  Y.  Ephraim,  “A  Bayesian  classification  approach  with  ap¬ 
plication  to  speech  recognition,”  IEEE  Transactions  on  Acoustics,  Speech, 
and  Signal  Processing,  vol.  39,  no.  10,  October  1991,  pp.  2157-2166. 

[47]  N.  Merhav  and  J.  Ziv,  “Estimating  with  Partial  Statistics  the  Parameters  of 
Ergodic  Finite  Markov  Sources,”  IEEE  Transactions  on  Information  The¬ 
ory,  vol.  35,  no.  2,  March  1989,  pp.  326-334. 

[48]  N.  Merhav  and  J.  Ziv,  “A  Bayesian  Approach  for  Classification  of  Markov 
Sources,”  IEEE  Transactions  Information  Theory,  vol.  37,  no.  4,  July  1991, 
pp.  1067-1071. 

[49]  D.  Middleton  and  R.  Esposito,  “Simultaneous  Optimum  Detection  and  Es¬ 
timation  of  Signals  in  Noise,”  IEEE  Transactions  on  Information  Theory, 
vol.  14,  no.  3,  May  1968,  pp.  434-444. 


114 


[50]  G.  F.  Hughes,  “Variance  Comparisons  for  Unbiased  Estimators  of  Probabil¬ 
ity  of  Correct  Classification,”  IEEE  Transactions  on  Information  Theory , 
vol.  22,  1976,  pp.  102-105. 

[51]  J.  E.  Mosimann,  “On  the  Compound  Multinomial  Distribution,  the 
Multivariate  Beta-Distribution,  and  Correlation’s  Among  Proportions,” 
Biometrika,  vol.  49,  no.  1,  1962,  pp.  65-82. 

[52]  A.  Nadas,  “Optimal  Solution  of  a  Training  Problem  in  Speech  Recognition,” 
IEEE  Transactions  on  Acoustics,  Speech,  and  Signal  Processing ,  vol.  33,  no. 
1,  February  1985,  pp.  326-329. 

[53]  R.  E.  Neapolitan,  Probabilistic  Reasoning  in  Expert  Systems,  John  Wiley  & 
Sons,  Inc.,  New  York,  1990. 

[54]  K.  L.  Oehler  and  R.  M.  Gray,  “Combining  Image  Compression  and  Classifi¬ 
cation  Using  Vector  Quantization,”  IEEE  Transactions  on  Pattern  Analysis 
and  Machine  Intelligence,  vol.  17,  no.  5,  May  1995,  pp.  461-473. 

[55]  L.  I.  Pettit  and  K.  D.  S.  Young,  “Measuring  the  Effect  of  Observations  on 
Bayes  Factors,”  Biometrika ,  vol.  77,  no.  3,  1990,  pp.  455-466. 

[56]  M.  Raghavachari,  “Limiting  Distributions  of  Kolmogorov-Smirnov  Type 
Statistics  Under  the  Alternative,”  The  Annals  of  Statistics,  vol.  1,  no.  1, 
1973,  pp.  67-73. 

[57]  S.  J.  Raudys  and  A.  K.  Jain,  “Sample  Size  Effects  in  Statistical  Pattern 
Recognition:  Recommendations  for  Practitioners,”  IEEE  Transactions  on 
Pattern  Analysis  and  Machine  Intelligence ,  vol.  13,  no.  3,  March  1991,  pp. 
252-264. 

[58]  A.  Sankar,  L.  Neumeyer,  and  M.  Weintrub,  “An  Experimental  Study  of 
Acoustic  Adaptation  Algorithms,”  Proceedings  of  the  IEEE  International 
Conference  on  Acoustics,  Speech,  and  Signal  Processing ,  May  1996,  pp.  713- 
716. 

[59]  B.  M.  Shahshahani  and  D.  A.  Landgrebe,  “The  Effect  of  Unlabeled  Samples 
in  Reducing  the  Small  Sample  Size  Problem  and  Mitigating  the  Hughes 
Phenomenon,”  IEEE  Transactions  on  Geoscience  and  Remote  Sensing,  vol. 
32,  no.  5,  September  1904,  pp.  1087-1095. 

[60]  M.  Sobel  and  V.  R.  R.  Uppuluri,  “Sparse  and  Crowded  Cells  and  Dirichlet 
Distributions,”  The  Annals  of  Statistics,  vol.  2,  no.  5,  1974,  pp.  977-987. 

[61]  J.  M.  Van  Campenhout,  “On  the  Peaking  of  the  Hughes  Mean  Recognition 
Accuracy:  The  Resolution  of  an  Apparent  Paradox,”  IEEE  Transactions  on 
Systems,  Man,  and  Cybernetics,  vol.  8,  no.  5,  May  1978,  pp.  390-395. 


115 


[62]  W.  G.  Waller  and  A.  K.  Jain,  “On  the  Monotonicity  of  the  Performance  of 
Bayesian  Classifiers,”  IEEE  Transactions  on  Information  Theory,  vol.  24, 
no.  3,  pp.  392-394. 

[63]  H.  Wang,  D.  Bell,  and  F.  Murtagh,  “Axiomatic  Approach  to  Feature  Subset 
Selection  Based  on  Relevance,”  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence,  vol.  21,  no.  3,  March  1999,  pp.  271-277. 

[64]  A.  D.  Wyner  and  J.  Ziv,  “Classification  with  Finite  Memory,”  IEEE  Trans¬ 
actions  on  Information  Theory,  vol.  42,  no.  2,  March  1996,  pp.  337-347. 

[65]  Q.  Xie,  C.  A.  Laszlo,  and  R.  K.  Ward,  “Vector  Quantization  Technique  for 
Nonparametric  Classifier  Design,”  IEEE  Transactions  on  Pattern  Analysis 
and  Machine  Intelligence,  vol.  15,  no.  12,  December  1993,  pp.  1326-1329. 

[66]  X.  Yao  and  Y.  Liu,  “A  New  Evolutionary  System  for  Evolving  Artificial 
Neural  Networks,”  IEEE  Transactions  on  Neural  Networks,  vol.  8,  no.  3, 
May  1997,  pp.  694-713. 

[67]  J.  Ziv,  “On  Classification  with  Empirically  Observed  Statistics  and  Univer¬ 
sal  Data  Compression,”  IEEE  Transactions  on  Information  Theory,  vol.  34, 
no.  2,  March  1988,  pp.  278-286. 


116 


INITIAL  DISTRIBUTION  LIST 


Addressee 


No.  of  Copies 


Defense  Technical  Information  Center 


2 


